This documentation describes the software release associated with deliverable D3.1 of the Cloud-TM project, namely the prototype implementation of the Workload and Performance Monitor (WPM). The document presents design/development activities carried out by the project’s partners in the context of WP3 (Task 3.1).
More in detail, the goal of this document is twofold:
It is important to clarify that the design/implementation decisions taken in relation to deliverable D3.1 are not meant as conclusive and inalterable. Conversely, they express the results of activities performed during the first year of the project, which will need to be validated (and possibly be subject to changes) while progressing with the research and development work to be carried out during the future phases of the project.
The present deliverable has also a relation with the deliverable D2.1 “Architecture Draft ”, where the complete draft of the architecture of the Cloud-TM platform is presented.
Anyway, WPM is also in charge of monitoring the performance parameters which will be selected as representative for the negotiated QoS levels.
1], which offers a conceptually very simple means for instantiating distributed monitoring systems. This framework relies on a reduced number of interacting components, each one devoted (and encapsulating) a specific task in relation to distributed data-gathering activities. In terms of interaction abstraction, the Lattice framework is based on the producer-consumer scheme, where both the producer and consumer components are, in their turn, formed by sub-components, whose instantiation ultimately determines the functionalities of the implemented monitoring system. A producer contains data sources which, in turn, contain one or more probes. Probes read data values to be monitored, encapsulate measures within measurement messages and put them into message queues. Data values can be read by probes periodically, or as a consequence of some event. A message queue is shared by the data source and the contained probes. When a measurement message is available within some queue, the data source sends it to the consumer, which makes it available to reporter components. Overall, the producer component injects data that are delivered to the consumer. Also, producer and consumer have the capability to interact in order to internally (re)configure their operating mode.
Three logical channels are defined for the interaction between the two components, named:
The data plane is used to transfer data-messages, whose payload is a set of measures, each kept within a proper message-field. The structure of the message (in terms of amount of fields, and meaning of each field) is predetermined. Hence, message-fields do not need to be explicitly tagged so that only data-values are really transmitted, together with a concise header tagging the message with very basic information, mostly related to source identification and timestamping. Such a structure can be anyway dynamically reconfigured via interactions supported by the info plane. This is a very relevant feature of Lattice since it allows minimal message footprint for (frequently) exchanged data-messages, while still enabling maximal flexibility, in terms of on-the-fly (infrequent) reconfiguration of the monitoring-information structure exchanged across the distributed components within the monitoring architecture.Finally, the control plane can be used for triggering reconfiguration of the producer component, e.g., by inducing a change of the rate at which measurements need to be taken. Notably, the actual transport mechanism supporting the planes is decoupled from the internal architecture of producer/consumer components. Specifically, data are disseminated across these components through configurable distribution mechanisms ranging from IP multicast to publish/subscribe systems, which can be selected on the basis of the actual deploy and which can even be changed over time without affecting other components, in term of their internal configuration. The framework is designed to support multiple producers and multiple consumers, providing the chance to dynamically manage data source configuration, probe-activation/deactivation, data sending rate, redundancy and so on.
By the reliance on JAVA, portability issues are mostly limited to the implementation of the ad-hoc components. As an example, a probe-thread based on direct access to the “proc” file system for gathering CPU/memory usage information is portable only across (virtualized) operating systems supporting that type of file system (e.g. LINUX). However, widening portability across general platforms would only entail reprogramming the internal logic of this probe, which in some cases can even be done by exploiting, e.g., pre-existing JAVA packages providing platform-transparent access to physical resource usage.The aforementioned portability considerations also apply to reporter-threads, which can implement differentiated, portable logics for exposing data to back-end applications (e.g. by implementing logics that store the data within a conventional database).
The SDG functionality maps onto an instantiation of the Lattice framework, with Cloud-TM specific probes and collectors. In our instantiation, the elements belonging to the Cloud-TM infrastructure, such as Virtual Machines (VMs), can be logically grouped, and each group will entail per-machine probes targeting two types of resources: (A) hardware/virtualized and (B) logical. Statistics for the first kind of resources are directly collected over the Operating System (OS), or via OS decoupled libraries, while statistics related to logical resources (e.g. the data-platform) are collected at the application level by relying on the JMX framework for JAVA components.
Figure 1: WPM Architectural Organization.
The data collected by the probes are sent to the producer component via the facilities natively offered by the Lattice framework. Each producer is coupled with one or many probes and it is responsible of managing them. The consumer is the Lattice component that receives the data from the producers, via differentiated messaging implementations, which could be selected on the basis of the specific system deploy. We envisage a LAN based clustering scheme such that the consumer is in charge of handling one or multiple groups of machines belonging to the same LAN. Anyway, in our architectural organization, the number of consumers is not meant to be fixed, instead it can be scaled up/down depending on the amount of instantiated probes/producers. Overall, the consumer can be instantiated as a centralized or a distributed process. Beyond collecting data from the producers, the consumer is also in charge of performing a local elaboration aimed at producing a suited stream representation to be provided as the input to the Log Service, which is in turn in charge of supporting the SDL functionality. We have decided to exploit the file system locally available at the consumer side to temporarily keep the stream instances to be sent towards the Log Service. The functional block which is responsible for the interaction between SDG and SDL is the so called optimized-transmission service. This can rely on top of differentiated solutions depending on whether the instance of SDL is co-located with the consumer or resides on a remote network. Generally speaking, with our organization we can exploit, e.g., SFTP or a locally shared File System. Also, stream compression schemes can be actuated to optimize both latency and storage occupancy.The Log Service is the logical component responsible for storing and managing all the gathered data. It must support queries from the Workload Analyzer so to expose the statistical data for subsequent processing/analysis. The Log Service could be implemented in several manners, in terms of both the underlying data storage technology and the selected deployment (centralized vs distributed). As for the first aspect, different solutions could be envisaged in order to optimize access operations depending on, e.g. suited tradeoffs between performance and access flexibility. This is also related with the data model ultimately supported by the Log Service, which might be a traditional relation model or, alternatively, a <key,value> model. Further, the Log Service could maintain the data onto a stable storage support or within volatile memory, for performance vs reliability tradeoffs. The above aspects are strictly coupled with the functionality/architecture of the Workload Analyzer, which could be designed to be implemented as a geographically distributed process in order to better fit the WPM deployment (hence taking advantage from data partitioning and distributed processing).
2]. Infrastructure oriented probes are in charge of gathering statistical data on the usage of:
1) CPU (per core):
3) Network interfaces:
For all of the above four resources, the associated sampling process can be configured with differentiated timeouts whose values can be selected on the basis of the time-granularity according to which the sampled statistical process is expected to exhibit non-negligible changes.
Currently, we have developed a prototype data platform probe that accesses the internal audit system of single Infinispan caches, in order to sample the below reported list of parameters:
We underline that the list of the above reported parameters, which are currently collected by the JMX client embedded within the data platform oriented probe prototype, is not meant to exhaustively cover the set of data platform parameters to be finally monitored within Cloud-TM. Specifically, such a final set will be defined while finalizing the Autonomic Manager architecture, since it will depend on the specific optimization policies supported by the Autonomic Manager. As an example, statistical data related to layers underlying Infinispan (e.g. the Group Communication layer) might be needed in order to support optimization policies explicitly taking into account latency and scalability aspects across the whole stack of layers forming the Cloud-TM platform. Further, the final set of data platform parameters to be monitored by WPM will also depend on the specific QoS parameters supported by Cloud-TM, which will be ultimately defined by finalizing the instantiation of the QoS API exposed to the overlying customer applications. Nevertheless, the current structure of the WPM prototype will support the collection of additional parameters via trivial extensions, not impacting the architectural organization described within this document.
In the design of the WPM prototype, we rely on the Autonomic Manager repository, by exploiting it as a registry where each probe can automatically retrieve information allowing it to univocally tag each measurement message sent to the Lattice consumer with the identity of the corresponding monitored component instance, as currently maintained by the registry. This will allow supporting a perfect matching between the measurement message and the associated instance of component, as seen by the Autonomic Manger at any time instant. Such a process has been supported by embedding within Lattice probes a sensing functionality, allowing the retrieval of basic information related to the environment where the probe is activated (e.g. the IP number of the VM hosting that instance of the probe), which has been coupled with a matching functionality vs the registry in order to both:
Such a behavior is shown in Figure 2, where the interaction with the registry is actuated as a query over specific component types, depending on the type of probe issuing the query (an infrastructure oriented probe will query the registry for extracting records associated with VM instances, while a data platform oriented probe will query the registry for extracting records related to the specific component it is in charge of).
As for point (b), data platform probes rely on the use of JMX servers exposed by monitored components. Hence, the information requested to correctly support the statistical data gathering process entails the address (e.g. the port number) associated with the JMX server instance to be contacted. The information associated with point (b) is a “don’t care” for infrastructure oriented probes since they do not operate via any intermediary (e.g. JMX server) entity.Given that the registry has not already been developed, and will be the object of future deliverables, the current prototype of the WPM emulates the registry access via stubs reading name/value records from the file system. The accessed files can be populated according to the specific needs while installing the prototype, as it will be specified via the README.
where start and end timestamp values within the file name identify the time interval during which the statistical data have been gathered by the consumer. These timestamp values are determined by exploiting the local clock accessible at the consumer side via the System.currentTimeMillis() service.
where the type_of_measure identifies the specific measure carried out for that component (e.g. CPU vs RAM usage in case of a VM component), and the value expressed by measure_timestamp is again generated via the local clock accessible by the probe instance producing the message. According to this prototype implementation, the Log Service exposes to the Workload Analyzer Infinispan native <key,value> API, which does not prevent the possibility of supporting a different API in future releases, depending on the needs associated with the interaction with the Workload Analyzer.
Workload Monitor Code (tar.bz2)