Discussion:
Resource Management/Locking [was: Re: What would be your ideal solution?]
John Burwell
2013-11-25 15:39:21 UTC
Permalink
Darren,

I originally presented my thoughts on this subject at CCC13 [1]. Fundamentally, I see CloudStack as having two distinct tiers — orchestration management and automation control. The orchestration tier coordinates the automation control layer to fulfill user goals (e.g. create a VM instance, alter a network route, snapshot a volume, etc) constrained by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs, quotas, etc). This layer must always be available to take new requests, and to report the best available infrastructure state information. Since execution of work is guaranteed on completion of a request, this layer may pend work to be completed when the appropriate devices become available.

The automation control tier translates logical units of work to underlying infrastructure component APIs. Upon completion of unit of work’s execution, the state of a device (e.g. hypervisor, storage device, network switch, router, etc) matches the state managed by the orchestration tier at the time unit of work was created. In order to ensure that the state of the underlying devices remains consistent, these units of work must be executed serially. Permitting concurrent changes to resources creates race conditions that lead to resource overcommitment and state divergence. A symptom of this phenomenon are the myriad of scripts operators write to “synchronize” state between the CloudStack database and their hypervisors. Another is the example provided below is the rapid create-destroy which can (and often does) leave dangling resources due to race conditions between the two operations.

In order to provide reliability, CloudStack vertically partitions the infrastructure into zones (independent power source/network uplink combination) sub-divided into pods (racks). At this time, regions are largely notional, as such, as are not partitions at this time. Between the user’s zone selection and our allocators distribution of resources across pods, the system attempts to distribute resources widely as possible across these partitions to provide resilience against a variety infrastructure failures (e.g. power loss, network uplink disruption, switch failures, etc). In order maximize this resilience, the control plane (orchestration + automation tiers) must be to operate on all available partitions. For example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we should be able to take and execute work in Zone A when one or more pods is lost, as well as, when taking and executing work in Zone B when Zone B has failed.

CloudStack is an eventually consistent system in that the state reflected in the orchestration tier will (optimistically) differ from the state of the underlying infrastructure (managed by the automation tier). Furthermore, the system has a partitioning model to provide resilience in the face of a variety of logical and physical failures. However, the automation control tier requires strictly consistent operations. Based on these definitions, the system appears to violate the CAP theorem [2] (Brewer!). The separation of the system into two distinct tiers isolates these characteristics, but the boundary between them must be carefully implemented to ensure that the consistency requirements of the automation tier are not leaked to the orchestration tier.

To properly implement this boundary, I think we should split the orchestration and automation control tiers into separate physical processes communicating via an RPC mechanism — allowing the automation control tier to completely encapsulate its work distribution model. In my mind, the tricky wicket is providing serialization and partition tolerance in the automation control tier. Realistically, there two options — explicit and implicit locking models. Explicit locking models employ an external coordination mechanism to coordinate exclusive access to resources (e.g. RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this model is ensuring the availability of the locking mechanism in the face of partition — forcing CloudStack operators to ensure that they have deployed the underlying mechanism in a partition tolerant manner (e.g. don’t locate all of the replicas in the same pod, deploy a cluster per zone, etc). Additionally, the durability introduced by these mechanisms inhibits the self-healing due to lock staleness.

In contrast, an implicit lock model structures the runtime execution model to provide exclusive access to a resource and model the partitioning scheme. One such model is to provide a single work queue (mailbox) and consuming process (actor) per resource. The orchestration tier provides a description of the partition and resource definitions to the automation control tier. The automation control tier creates a supervisor per partition which in turn manage process creation per resource. Therefore, process creation and destruction creates an implicit lock. Since automation control tier does not persist data in this model, The crash of a supervisor and/or process (supervisors are simply specialized processes) releases the implicit lock, and signals a re-execution of the supervisor/process allocation process. The following high-level process describes creation allocation (hand waves certain details such as back pressure and throttling):

The automation control layer receives a resource definition (e.g. zone description, VM definition, volume information, etc). These requests are processed by the owning partition supervisor exclusively in order of receipt. Therefore, the automation control tier views the world as a tree of partitions and resources.
The partition supervisor creates the process (and the associated mailbox) — providing it with the initial state. The process state is Initialized.
The process synchronizes the state of the underlying resource with the state provided. Upon successful completion of state synchronization, the state of the process becomes Ready. Only Ready processes can consume units of work from their mailboxes. The processes crashes. All state transitions and crashes are reported to interested parties through an asynchronous event reporting mechanism including the id of the unit of work the device represents.

The Ready state means that the underlying device is in a useable state consistent with the last unit of work executed. A process crashes when it is unable to bring the device into a state consistent with the unit of work being executed (a process crash also destroys the associated mailbox — flushing pending work). This event initiates execution of allocation process (above) until the process can be re-allocated in a Ready state (again throttling is hand waved for the purposes of brevity). The state synchronization step converges the actual state of the device with changes that occurred during unavailability. When a unit of work fails to be executed, the orchestration tier determines the appropriate recovery strategy (e.g. re-allocate work to another resource, wait for the availability of the resource, fail the operation, etc).

The association of one process per resource provides exclusive access to the resource without the requirement of an external locking mechanism. A mailbox per process provides orders pending units of work. Together, they provide serialization of operation execution. In the example provided, a unit of work would be submitted to create a VM and a second unit of work would be submitted to destroy it. The creation would be completely executed followed by the destruction (assuming no failures). Therefore, the VM will briefly exist before being destroyed. In conduction with a process location mechanism, the system can place the processes associated with resources in the appropriate partition allowing the system properly self heal, manage its own scalability (thinking lightweight system VMs), and systematically enforce partition tolerance (the operator was nice enough to describe their infrastructure — we should use it to ensure resilience of CloudStack and their infrastructure).

Until relatively recently, the implicit locking model described was infeasible on the JVM. Using native Java threads, a server would be limited to controlling (at best) a few hundred resources. However, lightweight threading models implemented by libraries/frameworks such as Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on reasonability sized servers and provide the supervisor/actor/mailbox abstractions described above. Most importantly, this approach does not require operators to become operationally knowledgeable of yet another platform/component. In short, I believe we can encapsulate these requirements in the management server (orchestration + automation control tiers) — keeping the operational footprint of the system proportional to the deployment without sacrificing resilience. Finally, it provides the foundation for proper collection of instrumentation information and process control/monitoring across data centers.

Admittedly, I have hand waved some significant issues that would beed to be resolved. I believe they are all resolvable, but it would take discussion to determine the best approach to them. Transforming CloudStack to such a model would not be trivial, but I believe it would be worth the (significant) effort as it would make CloudStack one of the most scalable and resilient cloud orchestration/management platforms available.

Thanks,
-John

[1]: http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki

P.S. I have CC’ed the developer mailing list. All conversations at this level of detail should be initiated and occur on the mailing list to ensure transparency with the community.
140 characters are not productive.
What would be your idea way to do distributed concurrency control? Simple use case. Server 1 receives a request to start a VM 1. Server 2 receives a request to delete VM 1. What do you do?
Darren
Darren Shepherd
2013-11-25 17:16:49 UTC
Permalink
You bring up some interesting points. I really need to digest this
further. From a high level I think I agree, but there are a lot of implied
details of what you've said.

Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers —
orchestration management and automation control. The orchestration tier
coordinates the automation control layer to fulfill user goals (e.g. create
a VM instance, alter a network route, snapshot a volume, etc) constrained
by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
quotas, etc). This layer must always be available to take new requests,
and to report the best available infrastructure state information. Since
execution of work is guaranteed on completion of a request, this layer may
pend work to be completed when the appropriate devices become available.
The automation control tier translates logical units of work to underlying
infrastructure component APIs. Upon completion of unit of work’s
execution, the state of a device (e.g. hypervisor, storage device, network
switch, router, etc) matches the state managed by the orchestration tier at
the time unit of work was created. In order to ensure that the state of
the underlying devices remains consistent, these units of work must be
executed serially. Permitting concurrent changes to resources creates race
conditions that lead to resource overcommitment and state divergence. A
symptom of this phenomenon are the myriad of scripts operators write to
“synchronize” state between the CloudStack database and their hypervisors.
Another is the example provided below is the rapid create-destroy which
can (and often does) leave dangling resources due to race conditions
between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time. Between the
user’s zone selection and our allocators distribution of resources across
pods, the system attempts to distribute resources widely as possible across
these partitions to provide resilience against a variety infrastructure
failures (e.g. power loss, network uplink disruption, switch failures,
etc). In order maximize this resilience, the control plane (orchestration
+ automation tiers) must be to operate on all available partitions. For
example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we
should be able to take and execute work in Zone A when one or more pods is
lost, as well as, when taking and executing work in Zone B when Zone B has
failed.
CloudStack is an eventually consistent system in that the state reflected
in the orchestration tier will (optimistically) differ from the state of
the underlying infrastructure (managed by the automation tier).
Furthermore, the system has a partitioning model to provide resilience in
the face of a variety of logical and physical failures. However, the
automation control tier requires strictly consistent operations. Based on
these definitions, the system appears to violate the CAP theorem [2]
(Brewer!). The separation of the system into two distinct tiers isolates
these characteristics, but the boundary between them must be carefully
implemented to ensure that the consistency requirements of the automation
tier are not leaked to the orchestration tier.
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical processes
communicating via an RPC mechanism — allowing the automation control tier
to completely encapsulate its work distribution model. In my mind, the
tricky wicket is providing serialization and partition tolerance in the
automation control tier. Realistically, there two options — explicit and
implicit locking models. Explicit locking models employ an external
coordination mechanism to coordinate exclusive access to resources (e.g.
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this
model is ensuring the availability of the locking mechanism in the face of
partition — forcing CloudStack operators to ensure that they have deployed
the underlying mechanism in a partition tolerant manner (e.g. don’t locate
all of the replicas in the same pod, deploy a cluster per zone, etc).
Additionally, the durability introduced by these mechanisms inhibits the
self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution model
to provide exclusive access to a resource and model the partitioning
scheme. One such model is to provide a single work queue (mailbox) and
consuming process (actor) per resource. The orchestration tier provides a
description of the partition and resource definitions to the automation
control tier. The automation control tier creates a supervisor per
partition which in turn manage process creation per resource. Therefore,
process creation and destruction creates an implicit lock. Since
automation control tier does not persist data in this model, The crash of
a supervisor and/or process (supervisors are simply specialized processes)
releases the implicit lock, and signals a re-execution of the
supervisor/process allocation process. The following high-level process
describes creation allocation (hand waves certain details such as back
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These requests
are processed by the owning partition supervisor exclusively in order of
receipt. Therefore, the automation control tier views the world as a tree
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) — providing it with the initial state. The process state is
Initialized.
3. The process synchronizes the state of the underlying resource with
the state provided. Upon successful completion of state synchronization,
the state of the process becomes Ready. Only Ready processes can consume
units of work from their mailboxes. The processes crashes. All state
transitions and crashes are reported to interested parties through an
asynchronous event reporting mechanism including the id of the unit of work
the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes when it
is unable to bring the device into a state consistent with the unit of work
being executed (a process crash also destroys the associated mailbox —
flushing pending work). This event initiates execution of allocation
process (above) until the process can be re-allocated in a Ready state
(again throttling is hand waved for the purposes of brevity). The state
synchronization step converges the actual state of the device with changes
that occurred during unavailability. When a unit of work fails to be
executed, the orchestration tier determines the appropriate recovery
strategy (e.g. re-allocate work to another resource, wait for the
availability of the resource, fail the operation, etc).
The association of one process per resource provides exclusive access to
the resource without the requirement of an external locking mechanism. A
mailbox per process provides orders pending units of work. Together, they
provide serialization of operation execution. In the example provided, a
unit of work would be submitted to create a VM and a second unit of work
would be submitted to destroy it. The creation would be completely
executed followed by the destruction (assuming no failures). Therefore,
the VM will briefly exist before being destroyed. In conduction with a
process location mechanism, the system can place the processes associated
with resources in the appropriate partition allowing the system properly
self heal, manage its own scalability (thinking lightweight system VMs),
and systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure — we should use it to ensure
resilience of CloudStack and their infrastructure).
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such as
Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on
reasonability sized servers and provide the supervisor/actor/mailbox
abstractions described above. Most importantly, this approach does not
require operators to become operationally knowledgeable of yet another
platform/component. In short, I believe we can encapsulate these
requirements in the management server (orchestration + automation control
tiers) — keeping the operational footprint of the system proportional to
the deployment without sacrificing resilience. Finally, it provides the
foundation for proper collection of instrumentation information and process
control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed to
be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming CloudStack
to such a model would not be trivial, but I believe it would be worth the
(significant) effort as it would make CloudStack one of the most scalable
and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC’ed the developer mailing list. All conversations at this
level of detail should be initiated and occur on the mailing list to ensure
transparency with the community.
140 characters are not productive.
What would be your idea way to do distributed concurrency control? Simple
use case. Server 1 receives a request to start a VM 1. Server 2 receives
a request to delete VM 1. What do you do?
Darren
Darren Shepherd
2013-11-25 17:29:22 UTC
Permalink
I will ask one basic question. How do you forsee managing one mailbox per
resource. If I have multiple servers running in an active-active mode, how
do you determine which server has the mailbox? Do you create actors on
demand? How do you synchronize that operation?

Darren


On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd <
Post by Darren Shepherd
You bring up some interesting points. I really need to digest this
further. From a high level I think I agree, but there are a lot of implied
details of what you've said.
Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers —
orchestration management and automation control. The orchestration tier
coordinates the automation control layer to fulfill user goals (e.g. create
a VM instance, alter a network route, snapshot a volume, etc) constrained
by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
quotas, etc). This layer must always be available to take new requests,
and to report the best available infrastructure state information. Since
execution of work is guaranteed on completion of a request, this layer may
pend work to be completed when the appropriate devices become available.
The automation control tier translates logical units of work to
underlying infrastructure component APIs. Upon completion of unit of
work’s execution, the state of a device (e.g. hypervisor, storage device,
network switch, router, etc) matches the state managed by the orchestration
tier at the time unit of work was created. In order to ensure that the
state of the underlying devices remains consistent, these units of work
must be executed serially. Permitting concurrent changes to resources
creates race conditions that lead to resource overcommitment and state
divergence. A symptom of this phenomenon are the myriad of scripts
operators write to “synchronize” state between the CloudStack database and
their hypervisors. Another is the example provided below is the rapid
create-destroy which can (and often does) leave dangling resources due to
race conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time. Between the
user’s zone selection and our allocators distribution of resources across
pods, the system attempts to distribute resources widely as possible across
these partitions to provide resilience against a variety infrastructure
failures (e.g. power loss, network uplink disruption, switch failures,
etc). In order maximize this resilience, the control plane (orchestration
+ automation tiers) must be to operate on all available partitions. For
example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we
should be able to take and execute work in Zone A when one or more pods is
lost, as well as, when taking and executing work in Zone B when Zone B has
failed.
CloudStack is an eventually consistent system in that the state reflected
in the orchestration tier will (optimistically) differ from the state of
the underlying infrastructure (managed by the automation tier).
Furthermore, the system has a partitioning model to provide resilience in
the face of a variety of logical and physical failures. However, the
automation control tier requires strictly consistent operations. Based on
these definitions, the system appears to violate the CAP theorem [2]
(Brewer!). The separation of the system into two distinct tiers isolates
these characteristics, but the boundary between them must be carefully
implemented to ensure that the consistency requirements of the automation
tier are not leaked to the orchestration tier.
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical processes
communicating via an RPC mechanism — allowing the automation control tier
to completely encapsulate its work distribution model. In my mind, the
tricky wicket is providing serialization and partition tolerance in the
automation control tier. Realistically, there two options — explicit and
implicit locking models. Explicit locking models employ an external
coordination mechanism to coordinate exclusive access to resources (e.g.
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this
model is ensuring the availability of the locking mechanism in the face of
partition — forcing CloudStack operators to ensure that they have deployed
the underlying mechanism in a partition tolerant manner (e.g. don’t locate
all of the replicas in the same pod, deploy a cluster per zone, etc).
Additionally, the durability introduced by these mechanisms inhibits the
self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution
model to provide exclusive access to a resource and model the partitioning
scheme. One such model is to provide a single work queue (mailbox) and
consuming process (actor) per resource. The orchestration tier provides a
description of the partition and resource definitions to the automation
control tier. The automation control tier creates a supervisor per
partition which in turn manage process creation per resource. Therefore,
process creation and destruction creates an implicit lock. Since
automation control tier does not persist data in this model, The crash of
a supervisor and/or process (supervisors are simply specialized processes)
releases the implicit lock, and signals a re-execution of the
supervisor/process allocation process. The following high-level process
describes creation allocation (hand waves certain details such as back
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These requests
are processed by the owning partition supervisor exclusively in order of
receipt. Therefore, the automation control tier views the world as a tree
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) — providing it with the initial state. The process state is
Initialized.
3. The process synchronizes the state of the underlying resource with
the state provided. Upon successful completion of state synchronization,
the state of the process becomes Ready. Only Ready processes can consume
units of work from their mailboxes. The processes crashes. All state
transitions and crashes are reported to interested parties through an
asynchronous event reporting mechanism including the id of the unit of work
the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes when it
is unable to bring the device into a state consistent with the unit of work
being executed (a process crash also destroys the associated mailbox —
flushing pending work). This event initiates execution of allocation
process (above) until the process can be re-allocated in a Ready state
(again throttling is hand waved for the purposes of brevity). The state
synchronization step converges the actual state of the device with changes
that occurred during unavailability. When a unit of work fails to be
executed, the orchestration tier determines the appropriate recovery
strategy (e.g. re-allocate work to another resource, wait for the
availability of the resource, fail the operation, etc).
The association of one process per resource provides exclusive access to
the resource without the requirement of an external locking mechanism. A
mailbox per process provides orders pending units of work. Together, they
provide serialization of operation execution. In the example provided, a
unit of work would be submitted to create a VM and a second unit of work
would be submitted to destroy it. The creation would be completely
executed followed by the destruction (assuming no failures). Therefore,
the VM will briefly exist before being destroyed. In conduction with a
process location mechanism, the system can place the processes associated
with resources in the appropriate partition allowing the system properly
self heal, manage its own scalability (thinking lightweight system VMs),
and systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure — we should use it to ensure
resilience of CloudStack and their infrastructure).
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such as
Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on
reasonability sized servers and provide the supervisor/actor/mailbox
abstractions described above. Most importantly, this approach does not
require operators to become operationally knowledgeable of yet another
platform/component. In short, I believe we can encapsulate these
requirements in the management server (orchestration + automation control
tiers) — keeping the operational footprint of the system proportional to
the deployment without sacrificing resilience. Finally, it provides the
foundation for proper collection of instrumentation information and process
control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed to
be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming CloudStack
to such a model would not be trivial, but I believe it would be worth the
(significant) effort as it would make CloudStack one of the most scalable
and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC’ed the developer mailing list. All conversations at this
level of detail should be initiated and occur on the mailing list to ensure
transparency with the community.
140 characters are not productive.
What would be your idea way to do distributed concurrency control?
Simple use case. Server 1 receives a request to start a VM 1. Server 2
receives a request to delete VM 1. What do you do?
Darren
John Burwell
2013-11-25 19:10:06 UTC
Permalink
Darren,

In a peer-to-peer model such as I describe, active-active is and is not a concept. The supervision tree is responsible for identifying failure, and initiating process re-allocation for failed resources. For example, if a pod’s management process crashed, it would also crash all of the processes managing the hosts in that pod. The zone would then attempt to restart the pod’s management process (either local to the zone supervisor or on a remote instance which could be configurable) until it was able to start “ready” process for the child resource.

This model requires a “special” root supervisor that is controlled by the orchestration tier which can identify when a zone supervisor becomes unavailable, and attempts to restart it. The ownership of this “special” supervisor will require a consensus mechanism amongst the orchestration tier processes to elect an owner of the process and determine when a new owner needs to be elected (e.g. a Raft implementation such as barge [1]). Given the orchestration tier is designed as an AP system, an orchestration tier process should be able to be an owner (i.e. the operator is not required to identify a “master” node). There are likely other potential topologies (e.g. a root supervisor per zone rather than one for all zones), but in all cases ownership election would be the same. Most importantly, there are no data durability requirements in this claim model. When an orchestration process becomes unable to continue owning a root supervisor, the other orchestration processes recognize the missing owner and initiate ownership claim the process for the partition.

In all failure scenarios, the supervision tree must be rebuilt from the point of failure downward using the process allocation process I previously described. For an initial implementation, I would recommend taking simply throwing any parts of the supervision tree that are already running in the event of a widespread failure (e.g. a zone with many pods). Dependent on the recovery time and SLAs, a future optimization may be to re-attach “orphaned” branches of the previous tree to the tree being built as part of the recovery process (e.g. loss a zone supervisor due to a switch failure). Additionally, the system would also need a mechanism to hand-off ownership of the root supervisor for planned outages (hardware upgrades/decommissioning, maintenance windows, etc).

Again, caveated with a a few hand waves, the idea is to build up a peer-to-peer management model that provides strict serialization guarantees. Fundamentally, it utilizes a tree of processes to provide exclusive access, distribute work, and ensure availability requirements when partitions occur. Details would need to be worked out for the best application to CloudStack (e.g root node ownership and orchestration tier gossip), but we would be implementing well-trod distributed systems concepts in the context cloud orchestration (sounds like a fun thing to do …).

Thanks,
-John

[1]: https://github.com/mgodave/barge

P.S. I see the libraries/frameworks referenced as the building blocks to a solution, but none of them (in whole or combination) solves the problem completely.
I will ask one basic question. How do you forsee managing one mailbox per resource. If I have multiple servers running in an active-active mode, how do you determine which server has the mailbox? Do you create actors on demand? How do you synchronize that operation?
Darren
You bring up some interesting points. I really need to digest this further. From a high level I think I agree, but there are a lot of implied details of what you've said.
Darren
Darren,
I originally presented my thoughts on this subject at CCC13 [1]. Fundamentally, I see CloudStack as having two distinct tiers — orchestration management and automation control. The orchestration tier coordinates the automation control layer to fulfill user goals (e.g. create a VM instance, alter a network route, snapshot a volume, etc) constrained by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs, quotas, etc). This layer must always be available to take new requests, and to report the best available infrastructure state information. Since execution of work is guaranteed on completion of a request, this layer may pend work to be completed when the appropriate devices become available.
The automation control tier translates logical units of work to underlying infrastructure component APIs. Upon completion of unit of work’s execution, the state of a device (e.g. hypervisor, storage device, network switch, router, etc) matches the state managed by the orchestration tier at the time unit of work was created. In order to ensure that the state of the underlying devices remains consistent, these units of work must be executed serially. Permitting concurrent changes to resources creates race conditions that lead to resource overcommitment and state divergence. A symptom of this phenomenon are the myriad of scripts operators write to “synchronize” state between the CloudStack database and their hypervisors. Another is the example provided below is the rapid create-destroy which can (and often does) leave dangling resources due to race conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the infrastructure into zones (independent power source/network uplink combination) sub-divided into pods (racks). At this time, regions are largely notional, as such, as are not partitions at this time. Between the user’s zone selection and our allocators distribution of resources across pods, the system attempts to distribute resources widely as possible across these partitions to provide resilience against a variety infrastructure failures (e.g. power loss, network uplink disruption, switch failures, etc). In order maximize this resilience, the control plane (orchestration + automation tiers) must be to operate on all available partitions. For example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we should be able to take and execute work in Zone A when one or more pods is lost, as well as, when taking and executing work in Zone B when Zone B has failed.
CloudStack is an eventually consistent system in that the state reflected in the orchestration tier will (optimistically) differ from the state of the underlying infrastructure (managed by the automation tier). Furthermore, the system has a partitioning model to provide resilience in the face of a variety of logical and physical failures. However, the automation control tier requires strictly consistent operations. Based on these definitions, the system appears to violate the CAP theorem [2] (Brewer!). The separation of the system into two distinct tiers isolates these characteristics, but the boundary between them must be carefully implemented to ensure that the consistency requirements of the automation tier are not leaked to the orchestration tier.
To properly implement this boundary, I think we should split the orchestration and automation control tiers into separate physical processes communicating via an RPC mechanism — allowing the automation control tier to completely encapsulate its work distribution model. In my mind, the tricky wicket is providing serialization and partition tolerance in the automation control tier. Realistically, there two options — explicit and implicit locking models. Explicit locking models employ an external coordination mechanism to coordinate exclusive access to resources (e.g. RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this model is ensuring the availability of the locking mechanism in the face of partition — forcing CloudStack operators to ensure that they have deployed the underlying mechanism in a partition tolerant manner (e.g. don’t locate all of the replicas in the same pod, deploy a cluster per zone, etc). Additionally, the durability introduced by these mechanisms inhibits the self-healing due to lock staleness.
The automation control layer receives a resource definition (e.g. zone description, VM definition, volume information, etc). These requests are processed by the owning partition supervisor exclusively in order of receipt. Therefore, the automation control tier views the world as a tree of partitions and resources.
The partition supervisor creates the process (and the associated mailbox) — providing it with the initial state. The process state is Initialized.
The process synchronizes the state of the underlying resource with the state provided. Upon successful completion of state synchronization, the state of the process becomes Ready. Only Ready processes can consume units of work from their mailboxes. The processes crashes. All state transitions and crashes are reported to interested parties through an asynchronous event reporting mechanism including the id of the unit of work the device represents.
The Ready state means that the underlying device is in a useable state consistent with the last unit of work executed. A process crashes when it is unable to bring the device into a state consistent with the unit of work being executed (a process crash also destroys the associated mailbox — flushing pending work). This event initiates execution of allocation process (above) until the process can be re-allocated in a Ready state (again throttling is hand waved for the purposes of brevity). The state synchronization step converges the actual state of the device with changes that occurred during unavailability. When a unit of work fails to be executed, the orchestration tier determines the appropriate recovery strategy (e.g. re-allocate work to another resource, wait for the availability of the resource, fail the operation, etc).
The association of one process per resource provides exclusive access to the resource without the requirement of an external locking mechanism. A mailbox per process provides orders pending units of work. Together, they provide serialization of operation execution. In the example provided, a unit of work would be submitted to create a VM and a second unit of work would be submitted to destroy it. The creation would be completely executed followed by the destruction (assuming no failures). Therefore, the VM will briefly exist before being destroyed. In conduction with a process location mechanism, the system can place the processes associated with resources in the appropriate partition allowing the system properly self heal, manage its own scalability (thinking lightweight system VMs), and systematically enforce partition tolerance (the operator was nice enough to describe their infrastructure — we should use it to ensure resilience of CloudStack and their infrastructure).
Until relatively recently, the implicit locking model described was infeasible on the JVM. Using native Java threads, a server would be limited to controlling (at best) a few hundred resources. However, lightweight threading models implemented by libraries/frameworks such as Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on reasonability sized servers and provide the supervisor/actor/mailbox abstractions described above. Most importantly, this approach does not require operators to become operationally knowledgeable of yet another platform/component. In short, I believe we can encapsulate these requirements in the management server (orchestration + automation control tiers) — keeping the operational footprint of the system proportional to the deployment without sacrificing resilience. Finally, it provides the foundation for proper collection of instrumentation information and process control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed to be resolved. I believe they are all resolvable, but it would take discussion to determine the best approach to them. Transforming CloudStack to such a model would not be trivial, but I believe it would be worth the (significant) effort as it would make CloudStack one of the most scalable and resilient cloud orchestration/management platforms available.
Thanks,
-John
[1]: http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC’ed the developer mailing list. All conversations at this level of detail should be initiated and occur on the mailing list to ensure transparency with the community.
140 characters are not productive.
What would be your idea way to do distributed concurrency control? Simple use case. Server 1 receives a request to start a VM 1. Server 2 receives a request to delete VM 1. What do you do?
Darren
Darren Shepherd
2013-11-25 21:18:12 UTC
Permalink
Okay, I'll have to stew over this for a bit. My one general comment is
that it seems complicated. Such a system seems like it would take a good
amount of effort to construct properly and as such it's a risky endeavour.

Darren
Post by John Burwell
Darren,
In a peer-to-peer model such as I describe, active-active is and is not a
concept. The supervision tree is responsible for identifying failure, and
initiating process re-allocation for failed resources. For example, if a
pod’s management process crashed, it would also crash all of the processes
managing the hosts in that pod. The zone would then attempt to restart the
pod’s management process (either local to the zone supervisor or on a
remote instance which could be configurable) until it was able to start
“ready” process for the child resource.
This model requires a “special” root supervisor that is controlled by the
orchestration tier which can identify when a zone supervisor becomes
unavailable, and attempts to restart it. The ownership of this “special”
supervisor will require a consensus mechanism amongst the orchestration
tier processes to elect an owner of the process and determine when a new
owner needs to be elected (e.g. a Raft implementation such as barge [1]).
Given the orchestration tier is designed as an AP system, an orchestration
tier process should be able to be an owner (i.e. the operator is not
required to identify a “master” node). There are likely other potential
topologies (e.g. a root supervisor per zone rather than one for all zones),
but in all cases ownership election would be the same. Most importantly,
there are no data durability requirements in this claim model. When an
orchestration process becomes unable to continue owning a root supervisor,
the other orchestration processes recognize the missing owner and initiate
ownership claim the process for the partition.
In all failure scenarios, the supervision tree must be rebuilt from the
point of failure downward using the process allocation process I previously
described. For an initial implementation, I would recommend taking simply
throwing any parts of the supervision tree that are already running in the
event of a widespread failure (e.g. a zone with many pods). Dependent on
the recovery time and SLAs, a future optimization may be to re-attach
“orphaned” branches of the previous tree to the tree being built as part of
the recovery process (e.g. loss a zone supervisor due to a switch failure).
Additionally, the system would also need a mechanism to hand-off ownership
of the root supervisor for planned outages (hardware
upgrades/decommissioning, maintenance windows, etc).
Again, caveated with a a few hand waves, the idea is to build up a
peer-to-peer management model that provides strict serialization
guarantees. Fundamentally, it utilizes a tree of processes to provide
exclusive access, distribute work, and ensure availability requirements
when partitions occur. Details would need to be worked out for the best
application to CloudStack (e.g root node ownership and orchestration tier
gossip), but we would be implementing well-trod distributed systems
concepts in the context cloud orchestration (sounds like a fun thing to do
…).
Thanks,
-John
[1]: https://github.com/mgodave/barge
P.S. I see the libraries/frameworks referenced as the building blocks to a
solution, but none of them (in whole or combination) solves the problem
completely.
I will ask one basic question. How do you forsee managing one mailbox per
resource. If I have multiple servers running in an active-active mode, how
do you determine which server has the mailbox? Do you create actors on
demand? How do you synchronize that operation?
Darren
On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd <
Post by Darren Shepherd
You bring up some interesting points. I really need to digest this
further. From a high level I think I agree, but there are a lot of implied
details of what you've said.
Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers —
orchestration management and automation control. The orchestration tier
coordinates the automation control layer to fulfill user goals (e.g. create
a VM instance, alter a network route, snapshot a volume, etc) constrained
by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
quotas, etc). This layer must always be available to take new requests,
and to report the best available infrastructure state information. Since
execution of work is guaranteed on completion of a request, this layer may
pend work to be completed when the appropriate devices become available.
The automation control tier translates logical units of work to
underlying infrastructure component APIs. Upon completion of unit of
work’s execution, the state of a device (e.g. hypervisor, storage device,
network switch, router, etc) matches the state managed by the orchestration
tier at the time unit of work was created. In order to ensure that the
state of the underlying devices remains consistent, these units of work
must be executed serially. Permitting concurrent changes to resources
creates race conditions that lead to resource overcommitment and state
divergence. A symptom of this phenomenon are the myriad of scripts
operators write to “synchronize” state between the CloudStack database and
their hypervisors. Another is the example provided below is the rapid
create-destroy which can (and often does) leave dangling resources due to
race conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time. Between the
user’s zone selection and our allocators distribution of resources across
pods, the system attempts to distribute resources widely as possible across
these partitions to provide resilience against a variety infrastructure
failures (e.g. power loss, network uplink disruption, switch failures,
etc). In order maximize this resilience, the control plane (orchestration
+ automation tiers) must be to operate on all available partitions. For
example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we
should be able to take and execute work in Zone A when one or more pods is
lost, as well as, when taking and executing work in Zone B when Zone B has
failed.
CloudStack is an eventually consistent system in that the state
reflected in the orchestration tier will (optimistically) differ from the
state of the underlying infrastructure (managed by the automation tier).
Furthermore, the system has a partitioning model to provide resilience in
the face of a variety of logical and physical failures. However, the
automation control tier requires strictly consistent operations. Based on
these definitions, the system appears to violate the CAP theorem [2]
(Brewer!). The separation of the system into two distinct tiers isolates
these characteristics, but the boundary between them must be carefully
implemented to ensure that the consistency requirements of the automation
tier are not leaked to the orchestration tier.
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical processes
communicating via an RPC mechanism — allowing the automation control tier
to completely encapsulate its work distribution model. In my mind, the
tricky wicket is providing serialization and partition tolerance in the
automation control tier. Realistically, there two options — explicit and
implicit locking models. Explicit locking models employ an external
coordination mechanism to coordinate exclusive access to resources (e.g.
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this
model is ensuring the availability of the locking mechanism in the face of
partition — forcing CloudStack operators to ensure that they have deployed
the underlying mechanism in a partition tolerant manner (e.g. don’t locate
all of the replicas in the same pod, deploy a cluster per zone, etc).
Additionally, the durability introduced by these mechanisms inhibits the
self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution
model to provide exclusive access to a resource and model the partitioning
scheme. One such model is to provide a single work queue (mailbox) and
consuming process (actor) per resource. The orchestration tier provides a
description of the partition and resource definitions to the automation
control tier. The automation control tier creates a supervisor per
partition which in turn manage process creation per resource. Therefore,
process creation and destruction creates an implicit lock. Since
automation control tier does not persist data in this model, The crash of
a supervisor and/or process (supervisors are simply specialized processes)
releases the implicit lock, and signals a re-execution of the
supervisor/process allocation process. The following high-level process
describes creation allocation (hand waves certain details such as back
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These requests
are processed by the owning partition supervisor exclusively in order of
receipt. Therefore, the automation control tier views the world as a tree
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) — providing it with the initial state. The process state is
Initialized.
3. The process synchronizes the state of the underlying resource
with the state provided. Upon successful completion of state
synchronization, the state of the process becomes Ready. Only Ready
processes can consume units of work from their mailboxes. The processes
crashes. All state transitions and crashes are reported to interested
parties through an asynchronous event reporting mechanism including the id
of the unit of work the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes when it
is unable to bring the device into a state consistent with the unit of work
being executed (a process crash also destroys the associated mailbox —
flushing pending work). This event initiates execution of allocation
process (above) until the process can be re-allocated in a Ready state
(again throttling is hand waved for the purposes of brevity). The state
synchronization step converges the actual state of the device with changes
that occurred during unavailability. When a unit of work fails to be
executed, the orchestration tier determines the appropriate recovery
strategy (e.g. re-allocate work to another resource, wait for the
availability of the resource, fail the operation, etc).
The association of one process per resource provides exclusive access to
the resource without the requirement of an external locking mechanism. A
mailbox per process provides orders pending units of work. Together, they
provide serialization of operation execution. In the example provided, a
unit of work would be submitted to create a VM and a second unit of work
would be submitted to destroy it. The creation would be completely
executed followed by the destruction (assuming no failures). Therefore,
the VM will briefly exist before being destroyed. In conduction with a
process location mechanism, the system can place the processes associated
with resources in the appropriate partition allowing the system properly
self heal, manage its own scalability (thinking lightweight system VMs),
and systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure — we should use it to ensure
resilience of CloudStack and their infrastructure).
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such as
Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on
reasonability sized servers and provide the supervisor/actor/mailbox
abstractions described above. Most importantly, this approach does not
require operators to become operationally knowledgeable of yet another
platform/component. In short, I believe we can encapsulate these
requirements in the management server (orchestration + automation control
tiers) — keeping the operational footprint of the system proportional to
the deployment without sacrificing resilience. Finally, it provides the
foundation for proper collection of instrumentation information and process
control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed to
be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming CloudStack
to such a model would not be trivial, but I believe it would be worth the
(significant) effort as it would make CloudStack one of the most scalable
and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC’ed the developer mailing list. All conversations at
this level of detail should be initiated and occur on the mailing list to
ensure transparency with the community.
On Nov 22, 2013, at 3:49 PM, Darren Shepherd <
140 characters are not productive.
What would be your idea way to do distributed concurrency control?
Simple use case. Server 1 receives a request to start a VM 1. Server 2
receives a request to delete VM 1. What do you do?
Darren
Pierre-Yves Ritschard
2013-11-26 08:52:50 UTC
Permalink
Hi Everyone,

First off, I'm really excited that there is an undergoing discussion on
these issues.
I agree with john that CAP provides a good "framework" for looking at the
individual properties of the distributed system that cloudstack is, as a
whole. The separation between an orchestration layer and automation layer
is also a valid abstraction of the main roles of the management server.

As far as CAP properties are concerned, I don't think there is much
question that the aim is for:

* a CP orchestration layer (it will continue to rely on a CP system: an
RDBMS)
* an AP automation layer (it is tied to an AP system, a cluster of
hypervisors)

As far as operations are concerned I think the plugin approach in CS is
great, it allows to distribute a very simple system to start with, where a
single management server will most likely run. In largely distributed
systems it is certainly not a crazy requirement to rely on zookeeper, in
many shops using CS, ZK is already used anyhow, operation-wise, it is not
more complex than, say, maintaining a highly available MySQL cluster.

Before I go on, I'll just acknowledge here that I'm not addressing the
issue of compatibility, all approaches discussed so far, except Darren's do
not concern themselves with compatibility and upgrades which will be a
major pain if the persistence layer / data store evolves in any significant
way. I know this is a big concern for CS users and citrix, and will need to
be taken into account, I don't have a clear picture of how this could be
done.

As far as persistence is concerned, there are different things that CS
stores which have different requirements:

* Organizational data needs strong consistency: users, accounts, domains,
projects, configuration (for networks, templates, ...)
* Transient resource data (vm running status) can only have eventual
consistency
* Usage data only requires eventual consistency (and does not need to
clutter the main data store)

I think one of the reasons for the head-scratching around resources right
now is that the persistence layer is right now used both for storing the
expected state of resources and their actual state, maybe their should be a
transient persistence layer used for storing known states.

So to sum up, as far as storage is concerned it might be easier to reason
about CS in terms of three different persistence layer:

* A main layer for organizational data, expected state and last known state
* A layer for storing state as reported by resource owners (hypervisors)
* A mechanism for distributing usage data

With such a system, the mailbox approach is possible. I do think that the
amount of work in CS would be huge and that we would risk ending up with a
franken-erlang type system which java doesn't lend itself well too (surely
scala could but this would imply a total rewrite).

An intermediate step could be to look at resources the same way Apache
Kafka does (or in a way Apache Cassandra). Managers could be seen as a
homogeneous clusters responsible for an nth of the cluster (for a cluster
of n managers). A good mechanism is needed for agreeing on cluster
membership, but there are several proven and valid approaches for this (and
its a problem that lends itself well to the plugin approach in CS).

A typical incoming API request would thus hit any management node, which
could either issue a redirect to the correct node, proxy it to the correct
node or create a jobid and let the client query the jobid for its status.

The upside of this approach is that it still makes it possible for CS to
become the jenkins of cloud controllers (it would need an HSQLDB option for
persistence though !) and rely on proven and well understood projects for
larger deployments (like ZK, or when it stabilizes, an implementation of
raft).

A first step towards this would be to have some sort of agreement on the
different layers of persistence needed throughout CS and try to move
forward. I can get my hands dirty and try to evolve the Dao stuff that is
everywhere in CS, but I'd like to know I'm not going towards a dead-end.








On Mon, Nov 25, 2013 at 10:18 PM, Darren Shepherd <
Post by Darren Shepherd
Okay, I'll have to stew over this for a bit. My one general comment is
that it seems complicated. Such a system seems like it would take a good
amount of effort to construct properly and as such it's a risky endeavour.
Darren
Post by John Burwell
Darren,
In a peer-to-peer model such as I describe, active-active is and is not a
concept. The supervision tree is responsible for identifying failure,
and
Post by John Burwell
initiating process re-allocation for failed resources. For example, if a
pod’s management process crashed, it would also crash all of the
processes
Post by John Burwell
managing the hosts in that pod. The zone would then attempt to restart
the
Post by John Burwell
pod’s management process (either local to the zone supervisor or on a
remote instance which could be configurable) until it was able to start
“ready” process for the child resource.
This model requires a “special” root supervisor that is controlled by the
orchestration tier which can identify when a zone supervisor becomes
unavailable, and attempts to restart it. The ownership of this “special”
supervisor will require a consensus mechanism amongst the orchestration
tier processes to elect an owner of the process and determine when a new
owner needs to be elected (e.g. a Raft implementation such as barge [1]).
Given the orchestration tier is designed as an AP system, an
orchestration
Post by John Burwell
tier process should be able to be an owner (i.e. the operator is not
required to identify a “master” node). There are likely other potential
topologies (e.g. a root supervisor per zone rather than one for all
zones),
Post by John Burwell
but in all cases ownership election would be the same. Most importantly,
there are no data durability requirements in this claim model. When an
orchestration process becomes unable to continue owning a root
supervisor,
Post by John Burwell
the other orchestration processes recognize the missing owner and
initiate
Post by John Burwell
ownership claim the process for the partition.
In all failure scenarios, the supervision tree must be rebuilt from the
point of failure downward using the process allocation process I
previously
Post by John Burwell
described. For an initial implementation, I would recommend taking
simply
Post by John Burwell
throwing any parts of the supervision tree that are already running in
the
Post by John Burwell
event of a widespread failure (e.g. a zone with many pods). Dependent on
the recovery time and SLAs, a future optimization may be to re-attach
“orphaned” branches of the previous tree to the tree being built as part
of
Post by John Burwell
the recovery process (e.g. loss a zone supervisor due to a switch
failure).
Post by John Burwell
Additionally, the system would also need a mechanism to hand-off
ownership
Post by John Burwell
of the root supervisor for planned outages (hardware
upgrades/decommissioning, maintenance windows, etc).
Again, caveated with a a few hand waves, the idea is to build up a
peer-to-peer management model that provides strict serialization
guarantees. Fundamentally, it utilizes a tree of processes to provide
exclusive access, distribute work, and ensure availability requirements
when partitions occur. Details would need to be worked out for the best
application to CloudStack (e.g root node ownership and orchestration tier
gossip), but we would be implementing well-trod distributed systems
concepts in the context cloud orchestration (sounds like a fun thing to
do
Post by John Burwell
…).
Thanks,
-John
[1]: https://github.com/mgodave/barge
P.S. I see the libraries/frameworks referenced as the building blocks to
a
Post by John Burwell
solution, but none of them (in whole or combination) solves the problem
completely.
On Nov 25, 2013, at 12:29 PM, Darren Shepherd <
I will ask one basic question. How do you forsee managing one mailbox
per
Post by John Burwell
resource. If I have multiple servers running in an active-active mode,
how
Post by John Burwell
do you determine which server has the mailbox? Do you create actors on
demand? How do you synchronize that operation?
Darren
On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd <
Post by Darren Shepherd
You bring up some interesting points. I really need to digest this
further. From a high level I think I agree, but there are a lot of
implied
Post by John Burwell
Post by Darren Shepherd
details of what you've said.
Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers —
orchestration management and automation control. The orchestration
tier
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
coordinates the automation control layer to fulfill user goals (e.g.
create
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
a VM instance, alter a network route, snapshot a volume, etc)
constrained
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
by policies defined by the operator (e.g. multi-tenacy boundaries,
ACLs,
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
quotas, etc). This layer must always be available to take new
requests,
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
and to report the best available infrastructure state information.
Since
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
execution of work is guaranteed on completion of a request, this layer
may
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
pend work to be completed when the appropriate devices become
available.
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
The automation control tier translates logical units of work to
underlying infrastructure component APIs. Upon completion of unit of
work’s execution, the state of a device (e.g. hypervisor, storage
device,
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
network switch, router, etc) matches the state managed by the
orchestration
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
tier at the time unit of work was created. In order to ensure that the
state of the underlying devices remains consistent, these units of work
must be executed serially. Permitting concurrent changes to resources
creates race conditions that lead to resource overcommitment and state
divergence. A symptom of this phenomenon are the myriad of scripts
operators write to “synchronize” state between the CloudStack database
and
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
their hypervisors. Another is the example provided below is the rapid
create-destroy which can (and often does) leave dangling resources due
to
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
race conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time.
Between the
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
user’s zone selection and our allocators distribution of resources
across
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
pods, the system attempts to distribute resources widely as possible
across
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
these partitions to provide resilience against a variety infrastructure
failures (e.g. power loss, network uplink disruption, switch failures,
etc). In order maximize this resilience, the control plane
(orchestration
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
+ automation tiers) must be to operate on all available partitions.
For
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
example, if we have two (2) zones (A & B) and twenty (20) pods per
zone, we
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
should be able to take and execute work in Zone A when one or more
pods is
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
lost, as well as, when taking and executing work in Zone B when Zone B
has
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
failed.
CloudStack is an eventually consistent system in that the state
reflected in the orchestration tier will (optimistically) differ from
the
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
state of the underlying infrastructure (managed by the automation
tier).
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
Furthermore, the system has a partitioning model to provide
resilience in
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
the face of a variety of logical and physical failures. However, the
automation control tier requires strictly consistent operations.
Based on
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
these definitions, the system appears to violate the CAP theorem [2]
(Brewer!). The separation of the system into two distinct tiers
isolates
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
these characteristics, but the boundary between them must be carefully
implemented to ensure that the consistency requirements of the
automation
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
tier are not leaked to the orchestration tier.
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical
processes
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
communicating via an RPC mechanism — allowing the automation control
tier
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
to completely encapsulate its work distribution model. In my mind, the
tricky wicket is providing serialization and partition tolerance in the
automation control tier. Realistically, there two options — explicit
and
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
implicit locking models. Explicit locking models employ an external
coordination mechanism to coordinate exclusive access to resources
(e.g.
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with
this
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
model is ensuring the availability of the locking mechanism in the
face of
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
partition — forcing CloudStack operators to ensure that they have
deployed
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
the underlying mechanism in a partition tolerant manner (e.g. don’t
locate
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
all of the replicas in the same pod, deploy a cluster per zone, etc).
Additionally, the durability introduced by these mechanisms inhibits
the
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution
model to provide exclusive access to a resource and model the
partitioning
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
scheme. One such model is to provide a single work queue (mailbox) and
consuming process (actor) per resource. The orchestration tier
provides a
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
description of the partition and resource definitions to the automation
control tier. The automation control tier creates a supervisor per
partition which in turn manage process creation per resource.
Therefore,
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
process creation and destruction creates an implicit lock. Since
automation control tier does not persist data in this model, The
crash of
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
a supervisor and/or process (supervisors are simply specialized
processes)
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
releases the implicit lock, and signals a re-execution of the
supervisor/process allocation process. The following high-level
process
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
describes creation allocation (hand waves certain details such as back
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These
requests
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
are processed by the owning partition supervisor exclusively in
order of
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
receipt. Therefore, the automation control tier views the world as
a tree
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) — providing it with the initial state. The process state
is
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
Initialized.
3. The process synchronizes the state of the underlying resource
with the state provided. Upon successful completion of state
synchronization, the state of the process becomes Ready. Only Ready
processes can consume units of work from their mailboxes. The
processes
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
crashes. All state transitions and crashes are reported to
interested
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
parties through an asynchronous event reporting mechanism including
the id
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
of the unit of work the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes
when it
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
is unable to bring the device into a state consistent with the unit of
work
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
being executed (a process crash also destroys the associated mailbox —
flushing pending work). This event initiates execution of allocation
process (above) until the process can be re-allocated in a Ready state
(again throttling is hand waved for the purposes of brevity). The
state
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
synchronization step converges the actual state of the device with
changes
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
that occurred during unavailability. When a unit of work fails to be
executed, the orchestration tier determines the appropriate recovery
strategy (e.g. re-allocate work to another resource, wait for the
availability of the resource, fail the operation, etc).
The association of one process per resource provides exclusive access
to
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
the resource without the requirement of an external locking mechanism.
A
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
mailbox per process provides orders pending units of work. Together,
they
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
provide serialization of operation execution. In the example
provided, a
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
unit of work would be submitted to create a VM and a second unit of
work
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
would be submitted to destroy it. The creation would be completely
executed followed by the destruction (assuming no failures).
Therefore,
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
the VM will briefly exist before being destroyed. In conduction with a
process location mechanism, the system can place the processes
associated
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
with resources in the appropriate partition allowing the system
properly
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
self heal, manage its own scalability (thinking lightweight system
VMs),
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
and systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure — we should use it to ensure
resilience of CloudStack and their infrastructure).
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such
as
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
Akka [3], Quasar [4], and Erjang [5] can scale to millions of
“threads” on
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
reasonability sized servers and provide the supervisor/actor/mailbox
abstractions described above. Most importantly, this approach does not
require operators to become operationally knowledgeable of yet another
platform/component. In short, I believe we can encapsulate these
requirements in the management server (orchestration + automation
control
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
tiers) — keeping the operational footprint of the system proportional
to
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
the deployment without sacrificing resilience. Finally, it provides
the
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
foundation for proper collection of instrumentation information and
process
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed
to
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming
CloudStack
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
to such a model would not be trivial, but I believe it would be worth
the
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
(significant) effort as it would make CloudStack one of the most
scalable
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC’ed the developer mailing list. All conversations at
this level of detail should be initiated and occur on the mailing list
to
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
ensure transparency with the community.
On Nov 22, 2013, at 3:49 PM, Darren Shepherd <
140 characters are not productive.
What would be your idea way to do distributed concurrency control?
Simple use case. Server 1 receives a request to start a VM 1.
Server 2
Post by John Burwell
Post by Darren Shepherd
Post by John Burwell
receives a request to delete VM 1. What do you do?
Darren
Edison Su
2013-11-25 19:05:09 UTC
Permalink
Won't the architecture used by Mesos/Omega solve the resource management/locking issue:
http://mesos.apache.org/documentation/latest/mesos-architecture/
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
Basically, one server holds all the resource information in memory (cpu/memory/disk/ip address etc) about the whole data center, all the hypervisor hosts or any other resource entities are connecting to this server to report/update its own resource. As there is only one master server, CAP theorem is invalid.
-----Original Message-----
Sent: Monday, November 25, 2013 9:17 AM
To: John Burwell
Subject: Re: Resource Management/Locking [was: Re: What would be your
ideal solution?]
You bring up some interesting points. I really need to digest this further.
From a high level I think I agree, but there are a lot of implied details of what
you've said.
Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers -
orchestration management and automation control. The orchestration
tier coordinates the automation control layer to fulfill user goals
(e.g. create a VM instance, alter a network route, snapshot a volume,
etc) constrained by policies defined by the operator (e.g.
multi-tenacy boundaries, ACLs, quotas, etc). This layer must always
be available to take new requests, and to report the best available
infrastructure state information. Since execution of work is
guaranteed on completion of a request, this layer may pend work to be
completed when the appropriate devices become available.
Post by John Burwell
The automation control tier translates logical units of work to
underlying infrastructure component APIs. Upon completion of unit of
work's execution, the state of a device (e.g. hypervisor, storage
device, network switch, router, etc) matches the state managed by the
orchestration tier at the time unit of work was created. In order to
ensure that the state of the underlying devices remains consistent,
these units of work must be executed serially. Permitting concurrent
changes to resources creates race
Post by John Burwell
conditions that lead to resource overcommitment and state divergence. A
symptom of this phenomenon are the myriad of scripts operators write
to "synchronize" state between the CloudStack database and their
hypervisors.
Post by John Burwell
Another is the example provided below is the rapid create-destroy
which can (and often does) leave dangling resources due to race
conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time.
Between the user's zone selection and our allocators distribution of
resources across pods, the system attempts to distribute resources
widely as possible across these partitions to provide resilience
against a variety infrastructure failures (e.g. power loss, network
uplink disruption, switch failures, etc). In order maximize this
resilience, the control plane (orchestration
+ automation tiers) must be to operate on all available partitions.
+ For
example, if we have two (2) zones (A & B) and twenty (20) pods per
zone, we should be able to take and execute work in Zone A when one or
more pods is lost, as well as, when taking and executing work in Zone
B when Zone B has failed.
CloudStack is an eventually consistent system in that the state
reflected in the orchestration tier will (optimistically) differ from
the state of the underlying infrastructure (managed by the automation
tier).
Post by John Burwell
Furthermore, the system has a partitioning model to provide
resilience in the face of a variety of logical and physical failures.
However, the automation control tier requires strictly consistent
operations. Based on these definitions, the system appears to violate
the CAP theorem [2] (Brewer!). The separation of the system into two
distinct tiers isolates these characteristics, but the boundary
between them must be carefully implemented to ensure that the
consistency requirements of the automation tier are not leaked to the
orchestration tier.
Post by John Burwell
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical
processes communicating via an RPC mechanism - allowing the
automation
Post by John Burwell
control tier to completely encapsulate its work distribution model.
In my mind, the tricky wicket is providing serialization and partition
tolerance in the automation control tier. Realistically, there two
options - explicit and implicit locking models. Explicit locking
models employ an external coordination mechanism to coordinate
exclusive access to resources (e.g.
Post by John Burwell
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with
this model is ensuring the availability of the locking mechanism in
the face of partition - forcing CloudStack operators to ensure that
they have deployed the underlying mechanism in a partition tolerant
manner (e.g. don't locate all of the replicas in the same pod, deploy a
cluster per zone, etc).
Post by John Burwell
Additionally, the durability introduced by these mechanisms inhibits
the self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution
model to provide exclusive access to a resource and model the
partitioning scheme. One such model is to provide a single work queue
(mailbox) and consuming process (actor) per resource. The
orchestration tier provides a description of the partition and
resource definitions to the automation control tier. The automation
control tier creates a supervisor per partition which in turn manage
process creation per resource. Therefore, process creation and
destruction creates an implicit lock. Since automation control tier
does not persist data in this model, The crash of a supervisor and/or
process (supervisors are simply specialized processes) releases the
implicit lock, and signals a re-execution of the supervisor/process
allocation process. The following high-level process describes
creation allocation (hand waves certain details such as back pressure and
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These
requests
Post by John Burwell
are processed by the owning partition supervisor exclusively in order of
receipt. Therefore, the automation control tier views the world as a tree
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) - providing it with the initial state. The process state is
Initialized.
3. The process synchronizes the state of the underlying resource with
the state provided. Upon successful completion of state synchronization,
the state of the process becomes Ready. Only Ready processes can
consume
Post by John Burwell
units of work from their mailboxes. The processes crashes. All state
transitions and crashes are reported to interested parties through an
asynchronous event reporting mechanism including the id of the unit of
work
Post by John Burwell
the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes
when it is unable to bring the device into a state consistent with the
unit of work being executed (a process crash also destroys the
associated mailbox - flushing pending work). This event initiates
execution of allocation process (above) until the process can be
re-allocated in a Ready state (again throttling is hand waved for the
purposes of brevity). The state synchronization step converges the
actual state of the device with changes that occurred during
unavailability. When a unit of work fails to be executed, the
orchestration tier determines the appropriate recovery strategy (e.g.
re-allocate work to another resource, wait for the availability of the
resource, fail the operation, etc).
Post by John Burwell
The association of one process per resource provides exclusive access
to the resource without the requirement of an external locking
mechanism. A mailbox per process provides orders pending units of
work. Together, they provide serialization of operation execution.
In the example provided, a unit of work would be submitted to create a
VM and a second unit of work would be submitted to destroy it. The
creation would be completely executed followed by the destruction
(assuming no failures). Therefore, the VM will briefly exist before
being destroyed. In conduction with a process location mechanism, the
system can place the processes associated with resources in the
appropriate partition allowing the system properly self heal, manage
its own scalability (thinking lightweight system VMs), and
systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure - we should use it to ensure
resilience of CloudStack and their infrastructure).
Post by John Burwell
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such
as Akka [3], Quasar [4], and Erjang [5] can scale to millions of
"threads" on reasonability sized servers and provide the
supervisor/actor/mailbox abstractions described above. Most
importantly, this approach does not require operators to become
operationally knowledgeable of yet another platform/component. In
short, I believe we can encapsulate these requirements in the
management server (orchestration + automation control
tiers) - keeping the operational footprint of the system proportional
to the deployment without sacrificing resilience. Finally, it
provides the foundation for proper collection of instrumentation
information and process control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed
to be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming
CloudStack to such a model would not be trivial, but I believe it
would be worth the
(significant) effort as it would make CloudStack one of the most
scalable and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-
cloud-
Post by John Burwell
stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC'ed the developer mailing list. All conversations at
this level of detail should be initiated and occur on the mailing list
to ensure transparency with the community.
On Nov 22, 2013, at 3:49 PM, Darren Shepherd
140 characters are not productive.
What would be your idea way to do distributed concurrency control?
Simple use case. Server 1 receives a request to start a VM 1. Server
2 receives a request to delete VM 1. What do you do?
Darren
John Burwell
2013-11-25 19:21:37 UTC
Permalink
Edison,

The CAP theorem applies to all distributed systems. One “master” controlling a bunch of a hypervisors being directed by orchestration engine + Zookeeper is a distributed system. In this case, a consistent system. In my very brief reading of it, CloudStack would need multiple Mesos masters to provide availability in event of zone or pod failures. It would run into the same issue explicit locking issues I previously described — ensuring the underlying Zookeeper infrastructure can maintain quorum in the face of a zone and/or pod failures. While it is possible to achieve, it would greatly increase the complexity of CloudStack deployments.

Thanks,
-John
Post by Edison Su
http://mesos.apache.org/documentation/latest/mesos-architecture/
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
Basically, one server holds all the resource information in memory (cpu/memory/disk/ip address etc) about the whole data center, all the hypervisor hosts or any other resource entities are connecting to this server to report/update its own resource. As there is only one master server, CAP theorem is invalid.
-----Original Message-----
Sent: Monday, November 25, 2013 9:17 AM
To: John Burwell
Subject: Re: Resource Management/Locking [was: Re: What would be your
ideal solution?]
You bring up some interesting points. I really need to digest this further.
From a high level I think I agree, but there are a lot of implied details of what
you've said.
Darren
Post by John Burwell
Darren,
I originally presented my thoughts on this subject at CCC13 [1].
Fundamentally, I see CloudStack as having two distinct tiers -
orchestration management and automation control. The orchestration
tier coordinates the automation control layer to fulfill user goals
(e.g. create a VM instance, alter a network route, snapshot a volume,
etc) constrained by policies defined by the operator (e.g.
multi-tenacy boundaries, ACLs, quotas, etc). This layer must always
be available to take new requests, and to report the best available
infrastructure state information. Since execution of work is
guaranteed on completion of a request, this layer may pend work to be
completed when the appropriate devices become available.
Post by John Burwell
The automation control tier translates logical units of work to
underlying infrastructure component APIs. Upon completion of unit of
work's execution, the state of a device (e.g. hypervisor, storage
device, network switch, router, etc) matches the state managed by the
orchestration tier at the time unit of work was created. In order to
ensure that the state of the underlying devices remains consistent,
these units of work must be executed serially. Permitting concurrent
changes to resources creates race
Post by John Burwell
conditions that lead to resource overcommitment and state divergence. A
symptom of this phenomenon are the myriad of scripts operators write
to "synchronize" state between the CloudStack database and their
hypervisors.
Post by John Burwell
Another is the example provided below is the rapid create-destroy
which can (and often does) leave dangling resources due to race
conditions between the two operations.
In order to provide reliability, CloudStack vertically partitions the
infrastructure into zones (independent power source/network uplink
combination) sub-divided into pods (racks). At this time, regions are
largely notional, as such, as are not partitions at this time.
Between the user's zone selection and our allocators distribution of
resources across pods, the system attempts to distribute resources
widely as possible across these partitions to provide resilience
against a variety infrastructure failures (e.g. power loss, network
uplink disruption, switch failures, etc). In order maximize this
resilience, the control plane (orchestration
+ automation tiers) must be to operate on all available partitions.
+ For
example, if we have two (2) zones (A & B) and twenty (20) pods per
zone, we should be able to take and execute work in Zone A when one or
more pods is lost, as well as, when taking and executing work in Zone
B when Zone B has failed.
CloudStack is an eventually consistent system in that the state
reflected in the orchestration tier will (optimistically) differ from
the state of the underlying infrastructure (managed by the automation
tier).
Post by John Burwell
Furthermore, the system has a partitioning model to provide
resilience in the face of a variety of logical and physical failures.
However, the automation control tier requires strictly consistent
operations. Based on these definitions, the system appears to violate
the CAP theorem [2] (Brewer!). The separation of the system into two
distinct tiers isolates these characteristics, but the boundary
between them must be carefully implemented to ensure that the
consistency requirements of the automation tier are not leaked to the
orchestration tier.
Post by John Burwell
To properly implement this boundary, I think we should split the
orchestration and automation control tiers into separate physical
processes communicating via an RPC mechanism - allowing the
automation
Post by John Burwell
control tier to completely encapsulate its work distribution model.
In my mind, the tricky wicket is providing serialization and partition
tolerance in the automation control tier. Realistically, there two
options - explicit and implicit locking models. Explicit locking
models employ an external coordination mechanism to coordinate
exclusive access to resources (e.g.
Post by John Burwell
RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with
this model is ensuring the availability of the locking mechanism in
the face of partition - forcing CloudStack operators to ensure that
they have deployed the underlying mechanism in a partition tolerant
manner (e.g. don't locate all of the replicas in the same pod, deploy a
cluster per zone, etc).
Post by John Burwell
Additionally, the durability introduced by these mechanisms inhibits
the self-healing due to lock staleness.
In contrast, an implicit lock model structures the runtime execution
model to provide exclusive access to a resource and model the
partitioning scheme. One such model is to provide a single work queue
(mailbox) and consuming process (actor) per resource. The
orchestration tier provides a description of the partition and
resource definitions to the automation control tier. The automation
control tier creates a supervisor per partition which in turn manage
process creation per resource. Therefore, process creation and
destruction creates an implicit lock. Since automation control tier
does not persist data in this model, The crash of a supervisor and/or
process (supervisors are simply specialized processes) releases the
implicit lock, and signals a re-execution of the supervisor/process
allocation process. The following high-level process describes
creation allocation (hand waves certain details such as back pressure and
1. The automation control layer receives a resource definition (e.g.
zone description, VM definition, volume information, etc). These
requests
Post by John Burwell
are processed by the owning partition supervisor exclusively in order of
receipt. Therefore, the automation control tier views the world as a tree
of partitions and resources.
2. The partition supervisor creates the process (and the associated
mailbox) - providing it with the initial state. The process state is
Initialized.
3. The process synchronizes the state of the underlying resource with
the state provided. Upon successful completion of state synchronization,
the state of the process becomes Ready. Only Ready processes can
consume
Post by John Burwell
units of work from their mailboxes. The processes crashes. All state
transitions and crashes are reported to interested parties through an
asynchronous event reporting mechanism including the id of the unit of
work
Post by John Burwell
the device represents.
The Ready state means that the underlying device is in a useable state
consistent with the last unit of work executed. A process crashes
when it is unable to bring the device into a state consistent with the
unit of work being executed (a process crash also destroys the
associated mailbox - flushing pending work). This event initiates
execution of allocation process (above) until the process can be
re-allocated in a Ready state (again throttling is hand waved for the
purposes of brevity). The state synchronization step converges the
actual state of the device with changes that occurred during
unavailability. When a unit of work fails to be executed, the
orchestration tier determines the appropriate recovery strategy (e.g.
re-allocate work to another resource, wait for the availability of the
resource, fail the operation, etc).
Post by John Burwell
The association of one process per resource provides exclusive access
to the resource without the requirement of an external locking
mechanism. A mailbox per process provides orders pending units of
work. Together, they provide serialization of operation execution.
In the example provided, a unit of work would be submitted to create a
VM and a second unit of work would be submitted to destroy it. The
creation would be completely executed followed by the destruction
(assuming no failures). Therefore, the VM will briefly exist before
being destroyed. In conduction with a process location mechanism, the
system can place the processes associated with resources in the
appropriate partition allowing the system properly self heal, manage
its own scalability (thinking lightweight system VMs), and
systematically enforce partition tolerance (the operator was nice
enough to describe their infrastructure - we should use it to ensure
resilience of CloudStack and their infrastructure).
Post by John Burwell
Until relatively recently, the implicit locking model described was
infeasible on the JVM. Using native Java threads, a server would be
limited to controlling (at best) a few hundred resources. However,
lightweight threading models implemented by libraries/frameworks such
as Akka [3], Quasar [4], and Erjang [5] can scale to millions of
"threads" on reasonability sized servers and provide the
supervisor/actor/mailbox abstractions described above. Most
importantly, this approach does not require operators to become
operationally knowledgeable of yet another platform/component. In
short, I believe we can encapsulate these requirements in the
management server (orchestration + automation control
tiers) - keeping the operational footprint of the system proportional
to the deployment without sacrificing resilience. Finally, it
provides the foundation for proper collection of instrumentation
information and process control/monitoring across data centers.
Admittedly, I have hand waved some significant issues that would beed
to be resolved. I believe they are all resolvable, but it would take
discussion to determine the best approach to them. Transforming
CloudStack to such a model would not be trivial, but I believe it
would be worth the
(significant) effort as it would make CloudStack one of the most
scalable and resilient cloud orchestration/management platforms available.
Thanks,
-John
http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-
cloud-
Post by John Burwell
stack-distributed-process-management
[2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
[3]: http://akka.io
[4]: https://github.com/puniverse/quasar
[5]: https://github.com/trifork/erjang/wiki
P.S. I have CC'ed the developer mailing list. All conversations at
this level of detail should be initiated and occur on the mailing list
to ensure transparency with the community.
On Nov 22, 2013, at 3:49 PM, Darren Shepherd
140 characters are not productive.
What would be your idea way to do distributed concurrency control?
Simple use case. Server 1 receives a request to start a VM 1. Server
2 receives a request to delete VM 1. What do you do?
Darren
Loading...