32 resultados para Complex Adaptive Systems
Resumo:
We consider the problem of optimizing the workforce of a service system. Adapting the staffing levels in such systems is non-trivial due to large variations in workload and the large number of system parameters do not allow for a brute force search. Further, because these parameters change on a weekly basis, the optimization should not take longer than a few hours. Our aim is to find the optimum staffing levels from a discrete high-dimensional parameter set, that minimizes the long run average of the single-stage cost function, while adhering to the constraints relating to queue stability and service-level agreement (SLA) compliance. The single-stage cost function balances the conflicting objectives of utilizing workers better and attaining the target SLAs. We formulate this problem as a constrained parameterized Markov cost process parameterized by the (discrete) staffing levels. We propose novel simultaneous perturbation stochastic approximation (SPSA)-based algorithms for solving the above problem. The algorithms include both first-order as well as second-order methods and incorporate SPSA-based gradient/Hessian estimates for primal descent, while performing dual ascent for the Lagrange multipliers. Both algorithms are online and update the staffing levels in an incremental fashion. Further, they involve a certain generalized smooth projection operator, which is essential to project the continuous-valued worker parameter tuned by our algorithms onto the discrete set. The smoothness is necessary to ensure that the underlying transition dynamics of the constrained Markov cost process is itself smooth (as a function of the continuous-valued parameter): a critical requirement to prove the convergence of both algorithms. We validate our algorithms via performance simulations based on data from five real-life service systems. For the sake of comparison, we also implement a scatter search based algorithm using state-of-the-art optimization tool-kit OptQuest. From the experiments, we observe that both our algorithms converge empirically and consistently outperform OptQuest in most of the settings considered. This finding coupled with the computational advantage of our algorithms make them amenable for adaptive labor staffing in real-life service systems.
Resumo:
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI prototype implementation, PAREP-MPI that allows changing the replica set. We have shown that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20 percent improvement in application efficiency even for exascale systems.