Skip to topic | Skip to bottom
Home
Main
Main.ArchitectureAndFrameworksr1.8 - 10 Feb 2005 - 23:52 - DougTodytopic end

Start of topic | Skip to actions

Future Astronomical Software Environments
Design Concepts and Architecture

Introduction

Much astronomical data analysis software in use today is found in systems which are 10-25 years old. The technology used in these systems is outdated and does not make effective use of the wealth of software developed outside astronomy over the past decade or more. These older systems have a largely closed architecture with only limited interoperability and software sharing between systems. Their architecture was not designed for modern astronomical data processing which is characterized by very large data volumes and distributed access to remote data and computation (Grid computing).

The Virtual Observatory (VO) addresses part of this problem, but VO is mainly about middleware and how widely distributed components talk to each other and interoperate. VO is less concerned with how scientific computation is actually performed, assuming that most computational and data analysis software will come from the astronomical community. Our main focus here is on this computational software which the astronomical community is expected to provide, which we would like to integrate well with the VO. The software required ranges from a desktop data analysis environment which can serve as a portal to the VO, to a computationally intensive server-side element such as a data access service or pipeline. A common architecture for all such software is desirable to allow software developed by the astronomer in the familiar local desktop environment to be dynamically deployed to a remote server, to enhance scalability and make it feasible to move the computation to the data.

The system concept we explore here is at the core a processing engine designed for scientific computation. Such computation can be complex and is often compute intensive, and may involve the execution of many components. The system we describe is tightly integrated and scalable, targeting primarily desktop and cluster processing within a local administrative domain. Grid computing is supported at a higher level by using the processing engine to build services which are linked together via the VO/Grid infrastructure.

To help develop the necessary software architecture it may help to look more carefully first at the problem domain. What types of users are we writing this software for? Who will use it? What types of problems do we expect the software to be used to solve?

User Perspectives

Various types of people would use the software we describe here (see also UserScenarios):

  • Astronomer. The astronomer just wants to get their research done with a minimum of fuss. They want a ready-to-use system which is easy to use and which either includes all the needed functionality or integrates well with standard tools. A range of functionality is needed, ranging from data reduction to data analysis, including tools for data visualization, data browsing, plotting, etc. Data management is important as modern data sets can be complex and large. Operations may combine both user data and data from public archives, and data and computation may be either local or remote.

  • Developer. In this context, by developer we refer primarily to someone who develops science (computational or algorithmic) software, e.g., for data processing or data analysis. Such software is often complex, may be highly structured, and is focused entirely on complex processing or analysis of data. For reasons of longevity and maximal re-use, science software wants to be as system and technology independent as possible. The science software developer wants to focus on science data processing functionality, with as little concern as possible for the details of the system framework or runtime environment within which the software will execute. It should be possible for a science software developer to work effectively without having to learn much about complex execution frameworks or system infrastructure.

  • System Integrator. The system integrator is anyone who builds an end-to-end, integrated system to give to users, or for use as a facility system at some site. The system integrator needs to decide what will be in their system and what the resultant system will look like to a user. While the system integrator wants to be in charge, they do not want to have to develop all the software from scratch. The system integrator wants re-usable software which can be flexibly integrated to build a wide variety of systems. The system integrator needs to control software versions, needs software available in source form to be able to diagnose and fix problems, and needs to be able to integrate and redistribute any software used. Examples of the types of systems a system integrator might produce include a desktop system for end-user data analysis, a data reduction pipeline for an instrument or survey, or an archive front-end for interfacing an archive to the VO.

An analogous case of system integration is the Linux operating system. Multiple variants of Linux can be produced from the same underlying framework (the Linux kernel) and "components" (various open source packages and libraries). Aside from being higher level, our case differs in that the system integrations may address problem domains which differ to a greater extent than different Linux variants do. Examples of these different problem domains, or use-cases, are given in the next section.

Design Reference Use-Cases

We would like to have a common software architecture for at least the following application areas:

  • Desktop Data Processing and Analysis. The user at their workstation interactively processes and interacts with science data to perform their personal research. The data involved may be user data, workgroup data, public data from some archive, or any combination of the above. The same facilities are used regardless of whether data or computation is local or remote. It must be possible to exploit local resources (e.g., a workstation or departmental cluster) for processing, but access to remote resources is desired as well. The toolset available is controlled locally by the user. The user may write their own software for custom data processing or analysis. Most commonly this will be at the level of a script or workflow executing within the provided desktop environment, but capabilities for user development of computational components should be provided as well.

  • VO Data Access and Analysis. In this case we have a service implemented as a front-end to some archive, with high bandwidth access to the stored data and with access to adequate computational resources local to the data. The archive in question may be a specialized archive for some specific data collection, e.g., an observatory or survey archive, or a large scale data warehouse on the Grid. We would like to implement the standard VO data access services as well as provide the ability to execute user processing scripts with high-bandwidth access to the data. Even the standard VO data access services may require significant computation (e.g., OTF generation of images or spectra, simulated observations, or computation of virtual data conformant to a standard data model from archival data), so this use-case may require significant scientific computation beyond what a mere Web service interface would provide. If we add the capability to execute user-defined computation on the server (part of the Desktop use-case outlined above) then a common architecture for both server and desktop is also required.

  • Pipeline Processing. Generation of virtual data products in response to a VO data access request is very similar to conventional pipeline processing. In both cases a sequence of mostly automated processing steps are performed to generate the desired data product from more fundamental data. The main differences are in how data processing is driven and in the type of processing performed. In the case of data access, processing is driven by a client trying to access a virtual data product via a Web service; in the case of an automated pipeline, processing is driven by the dataflow from an instrument or survey. While the processing involved is very similar, the system integration required may be quite different - a facility pipeline for an instrument may look quite different at the top level than a VO service, and may involve completely different software at the top level. But at the level of the actual science data processing required the two systems may be very similar. (Not everyone agrees that pipeline processing should be a primary use-case but we include it here nonetheless as it is a requirement for some of us to export pipelines to users and not just run them in a one-off fashion in the back room).

All of these use-cases involve complex, data and computationally intensive scientific data processing, where a number of processing steps are performed, each of which may require application of a complex algorithm or process. While the user interface may vary greatly, in most cases the processing required is much the same regardless of the context, or how much data is involved, or whether the data is local or remote. Hence issues such as scalability or location transparency should be factored out, dealt with as part of the execution framework which controls how the computation is actually carried out. In all cases the processing required is similar, hence a common solution is indicated, allowing us to share software for all of these use-cases. Aside from software sharing, a common solution is desirable as we need to allow observers to manually process observational data from the desktop, and we would like to provide the flexibility to dynamically deploy software components in any of these three contexts.

Core Architecture

If we analyze the high level requirements presented above we find that a different approach is required for science software than for system software. Science software is computational in nature, often complex in the processing required, but with a highly abstracted interface which largely isolates it from the details of the external environment in which it functions. System software, e.g., for the execution framework, for data management, or for user and other external interfaces, is where we address systems problems like scalability and distributed execution, and is where we most need to capitalize on modern technology and the wealth of software available from outside astronomy. Below a certain level all scientific data processing is similar in nature, with adaptation to the different use-cases occuring mostly in the highest level software.

Component-Framework Architecture

This analysis, plus related requirements for things like scalability, multi-language support, support for legacy software, and so forth lead us to a distributed component-framework architecture. In this approach, most astronomical software is cast into the form of re-usable components which can be deployed in various ways. Components execute within a container which defines the life cycle and run time environment seen by a component. The container is in turn controlled by an execution framework of some sort. At the highest level an applications layer brings it all together and is used to steer things. A presentation layer may also be required to present the functionality provided by the system to the outside world.

  • Presentation Layer. Presents the functionality of the system to whatever outside agent drives the system, be it a human user, a telescope dataflow, a Grid workflow, a Web browser interface, or whatever. In implementation terms the presentation layer might be a CLI, a GUI written in Java or C++, a Web service interface written in Java or .NET, and so forth. In principle the same system functionality can be made available in all such contexts.

  • Applications Layer. The applications layer is used to implement top level applications. In implementation terms the applications layer can be anything which can talk to the execution framework to execute components. For example, a scripting language such as Python, a high level programming language such as Java, a GUI written in Java or C++, a workflow engine for a pipeline, etc. The applications layer is high level "glue" code, with all of the heavy processing taking place in components.

  • Execution Framework. The execution framework (often referred to merely as "the framework") defines the distributed virtual machine seen by the applications layer, and provides the functionality needed to allow applications to execute components. The execution framework provides capabilities such as component registration and management, distributed execution, scalability, messaging, logging, and so forth. A range of execution frameworks of varying degrees of capability are possible. For a scalable system for use on a desktop or high performance cluster we need a fully distributed framework capable of managing the execution of components running on a number of compute nodes simultaneously.

  • Components. A component is a computational object, with one or more service methods, which can be plugged into the framework. Components are often grouped into component packages of related components. Components provide all the real functionality of the system. Components can be written in any major language and can be either new code or can be produced by wrapping legacy code. Components written in different languages can be mixed together in the same system at runtime. For scientific use the highest priority is to support components written in compiled languages such as C, C++, and FORTRAN, however components may also be written in other languages such as Java, or may be produced by putting a component wrapper in front of scripting language code such as Python.

cfw-arch.gif

The component-framework architecture outlined here is an open architecture. What this means is that, while we have an overall system architecture in mind, the major elements of the system can be used separately, and can be integrated into other system architectures. For example, most components can be used stand-alone and can be integrated into any externally-defined system. The execution framework is a separate product which can be integrated into any system. Any technology can be used for the applications and presentations layers. Components are interchangeable (in terms of interface if not semantics), as are execution frameworks. Any technology can be used to build system elements such as components or the system framework. While we baseline Python as the defacto scripting language, there is nothing to prevent other scripting languages from being used as well. A major advantage of an open architecture is that the major elements of the system can evolve independently, making it easier to use new technology as it becomes available.

The component-framework architecture addresses the overall system structure and functioning, but says little about what goes on within a component. In general there is no guarantee that components from different sources will be interoperable. While this is an important problem it is one which is best addressed separately, e.g., by defining standard data models for the data objects operated upon by components. The execution framework itself, being system software, is neutral about what goes on inside a component.

Component-Container

Components execute within a container which defines the life cycle and runtime environment seen by the component. The container defines how a component is invoked and the interface to the external world seen by the component. In principle neither the component, nor the developer writing it, needs to know anything more about the execution framework, applications and presentation layer, or the environment within which the component will be used, than what is defined by the abstract interface defined by the component-container interface. This is important not just to make component development easier, but to ensure that components are re-usable in various contexts.

  • In the simplest case the component is a pre-existing host program to be invoked with arguments on the command line, and the container is a process which runs the "component" as a managed subprocess. This approach has the advantage of allowing much existing software to be run without modification with only an adapter, but only a limited component-container interface is possible.

  • More generally, the container is a library which is linked with the object code of one or more components to produce an executable process. Linking can be either at compile time (the usual case for most science code) or at run time via a dynamically loadable library (useful for smaller plug-in extensions). The container manages the interface of the component to the execution environment.

  • Encapsulation of the component interface in the container allows multiple framework protocols to be supported. At the simplest level a single component can be invoked by running the container at the host level, with arguments on the command line. For fully distributed execution the container can be run from the execution framework via a protocol such as CORBA or SOAP which supports concurrent execution of components. In this case the full range of container services are available to the component for runtime messaging, communication with other components, passing parameters in both directions, and so forth.

The container provides standard container services which are available to any component. These include two-way parameter handling, a distributed environment mechanism, remote method invocation, and asynchronous messaging. At the most basic level, messaging allows messages (arbitrary blocks of information, often tagged by a message class and related metadata) to be efficiently exchanged between components during execution. Messaging may be either point-to-point between two components, or broadcast, in which case a component produces messages which are boadcast to zero or more message consumers. Clients can dynamically register with a messaging service to indicate the classes of messages they wish to receive. Messaging includes support for message streams, including high performance point-to-point data pipes, a logging service, used by components to broadcast time-tagged log messages, and property change events, used to broadcast changes to the state of a component to subscribing clients, e.g., a GUI display.

In the most general case a component is a distributed object (DO) implemented as a class with methods. Some of these methods are standard methods defined by the container and used to implement the standard container interface and life cycle. Other methods are custom service methods defined by the component to expose whatever custom functionality the component provides.

Tasks and Parameters

An important type of component is a task component, which implements a single service method. Task components use a parameter mechanism to control the task. Unlike DOs, which can be long running and stateful, tasks are general stateless, with all state maintained in the client, in the framework, or in persistent external storage of some sort. Tasks have the advantage of having a relatively simple interface and simple runtime semantics, increasing deployment flexibility and aiding scalability. DOs provide finer grain functionality at the cost of a more complex interface and more complex runtime semantics.

Parameters are organized into named parameter sets. Every task has an associated parameter set, used to pass data operands and control parameters to the task. Parameter sets can also exist independently of tasks as named data entities. Some parameters must be given values in order to execute a task. Other parameters have default values and are "hidden"; hidden parameters need only be specified if it is necessary to override the default value. Parameter sets can also be used to return data from a task to the caller (e.g., to a script) or to pass modest amounts of data between tasks.

A parameter set consists of a set of parameter objects. Each parameter object has a number of attributes such as a parameter name, type, default value, min and max or enumerated value, query mode, prompt string, help reference, and optional unit and error values. Both primitive (string, int, float, bool, etc.) and abstract (e.g., "file", "cursor", etc.) parameter types are permitted. While a parameter set defines a simple flat namespace with no hierarchical structure, a parameter set may reference one or more other parameter sets, allowing simple hierarchical relationships to be defined. While a full description is required to define a parameter set, all that is required at runtime to invoke a task is a simple keyword table or dictionary consisting of parameter name and value pairs.

Execution Framework

The execution framework connects all the other elements of the system, allowing client applications to execute components, and allowing components to communicate with each other. At the core the framework is a software bus which provides a standard way for components to connect ("plug in") and communicate via messaging. Major elements of the system framework include the following:

  • Distributed Virtual Machine. More than just a software bus, the framework implements a distributed virtual machine (DVM) abstraction. The DVM manages the execution of components on multiple compute nodes as if they were a single machine, providing transparent scalability for cluster computing. Compute nodes may be dynamically added to or removed from the DVM during execution. On a cluster the application script would normally run on the head (login) node, with all the computation taking place in components running on compute nodes. A parallel file system of some sort would provide high performance shared access to storage for all compute nodes. On a workstation everything runs on the head node. Efficient execution on a single machine is a priority for desktop use. An optional software console provides facilities for booting and controlling the DVM. An optional logging service provides instrumentation for monitoring the execution of components.

  • Package Manager. The package manager service is responsible for managing all information related to component packages and parameter sets. When a component package is "plugged into" the framework, the package metadata is ingested by the package manager, defining all components and making them immediately available for execution via the DVM. In addition to registering components and maintaining their metadata, the package manager provides persistent storage and distributed access methods for parameter sets.

  • Software Bus. The software bus provides basic facilities for components and containers to plug into the framework, for execution of components, for exchanging messages between components, and so forth. Further information on messaging is given in the container discussion above.

To interface an applications layer component such as Python or Java to the framework, all that is necessary is to bind the framework services into the target language environment.

In implementation terms the framework can be anything which implements the logical model and services defined by the framework definition. This could be anything from a thin layer in an applications language such as Python (with the Python session directly executing components), to a fully distributed framework implemented using some technology such as CORBA or SOAP, possibly combined with a package such as MPI for high performance messaging.

Scalability

Scalability is provided by the DVM, by the execution framework and messaging system, and by the underlying cluster hardware and software, e.g., parallel file system. In the simplest (default) case, everything executes on a single node such as a workstation or laptop. In the case of a cluster, interactive components and application scripts execute on the head node, computational components execute on a variable number of compute nodes, and one or more storage nodes provide shared access to parallel file system, with a fast switch connecting all nodes. In a Grid scenario, the system described here is itself a compute or storage element in some externally managed Grid workflow (see the discussion of VO/Grid integration below).

  • Basic scalability is provided by the distributed execution capabilities of the DVM in combination with a parallel file system. This provides a basic data-parallel capability, with multiple compute tasks executing independently on separate datasets (examples of parallel file systems include, in no particular order, PVFS/PVFS2, GFS, IBM GPFS, Lustre, Ibrix, Mosix, etc.).

  • Parallel algorithm support requires use of an existing technology such as MPI. A mechanism is needed to tell the DVM to execute N copies of a component (distributed over the compute nodes) which will function as a parallel unit. A parallel task appears as a single unit of computation to the applications layer.

  • Transparent support for data-parallel computation (the same operation applied independently to N separate datasets) requires some support at the applications layer level, e.g., in Python. The same simple procedural SIMD script is used regardless of the number of compute nodes available, with the framework automatically scaling the computation to utilize the available resources. All that is required to implement this is some sort of RUN construct in the scripting language which tells the framework to apply the same operation (task) on N datasets in parallel.

  • Parallel operations on extremely large datasets may require concurrent operations on portions of a large dataset. There are various ways to address this problem. One promising approach is to use a shared __data access object__ (DAO) to mediate access to the object by multiple concurrently executing compute nodes. The object itself is stored in numerous smaller files at the level of the storage system (this is desirable in any case for an extremely large dataset), with the DAO managing access to the overall logical data object. The actual file i/o is parallel at the level of the individual object elements, with each compute node having a parallel (or nearly so) data path to the file element via the parallel file system. Alternatives would be to handle the i/o directly in the DAO, or a use a combination of DAO i/o and file i/o.

  • Concurrent database access is something which has been provided for years by relational databases. The architecture is similar to a DAO, with the database server mediating concurrent access to a database or table. Modern database systems take this one step further and have the ability to distribute large queries over the nodes of a small cluster. Further gains are possible via segmentation of a database at the logical database design level.

The challenge in providing scalability in our case is managing the complexity and size of the system. While we want to achieve transparent scalability, it is essential that the resultant system be "lean and mean" enough to be usable on a modest desktop system. This means that the system must be no more difficult to use on a desktop system than a nonscalable system, and the system size and startup time must be comparable to a typical desktop system.

User Interfaces

User interfaces are part of the presentation layer. User interfaces are not required in all circumstances, e.g., a server which exports functionality via Web services need not implement a user interface. An interactive desktop system will however require user interfaces such as a command language interface (CLI), various graphical user interfaces (GUIs) or visualization components such as an image display or data browser. User interfaces can also be providing using Web browser technology or via a scripting language such as Python.

gui-arch.gif

CLI. The purpose of a CLI is to provide a streamlined environment for command entry. A good CLI will provide a simple syntax for command entry, e.g., without the need to parenthesize argument lists or quote strings. Parameter support should be included including parameter defaulting, prompting for missing parameters, parameter editing, and so forth. Shell-like facilities for command completion, i/o redirection, history, aliasing, and so forth are often provided. Simple control flow constructs may be provided in the language so long as doing so does not conflict with the efficient use of the CLI for command entry. Complex scripting is better done in a real scripting language such as Python or Perl.

While there are many ways one could implement a CLI, one approach which would work well given Python as the scripting language would be to implement the CLI as a Python module. Since the framework interface would be to Python itself, the CLI module would see only a Python API hence would be largely framework independent. While the CLI could be implemented as a separate component, putting it in the same address space as the Python interpreter would provide better integration with the scripting language and would allow an expert user to alternate between the CLI and interactive Python within the same session.

GUIs. A number of alternatives are possible for implementing GUIs. Implementing GUIs directly in the scripting language environment is one possibility. This can be a good way to provide a well integrated environment for user software development, or to provide dedicated GUIs for a CLI (such as a parameter editor), but if carried too far starts to overload the scripting language component. A more scalable approach is to provide a GUI server component which executes a downloadable (applet-like) GUI script, and which uses messaging to communicate with other components via the framework. The GUI listens for property change events transmitted from remote computational components via the framework, updating the GUI display as events occur, and sending requests to the remote component in response to user commands entered via the GUI. For applications where messaging performance is an issue, a dynamically loadable plug-in component can be used to move computational components into the address space of the GUI. The GUI server approach works well for adding optional GUIs to computational components, as well as for implementing large visualization components such as an image browser.

Legacy Software Integration

The principal strategy for adapting legacy software to this new architecture is to wrap legacy software as components. Most components would be computational components (science software). Legacy software could also be adapted for use in the presentation layer, for things like the CLI, data visualization, image display, and plotting. At the simplest level, integrating legacy components is merely a matter of writing an adapter. To fully integrate such software it might be necessary to modify the portion of the legacy software which implements the existing tasking, parameter management, etc., functionality to use the new container technology. One might also add some code to the component to use new facilities such as runtime messaging and logging. Studies of existing systems indicate that this is quite feasible. Since the container can support multiple runtime protocols it might be possible to provide backward compatibility to support legacy frameworks and user interfaces.

The most notable problem with integrating legacy software into a common architecture is not system software, but incompatible data models and data formats. While legacy software can function within a common framework without addressing this problem, interoperability of code from different systems will be limited.

Data Models

Much of science data processing and analysis concerns operations upon complex data objects such as flux and astrometrically calibrated sky projection images, spectral data cubes, multiband imagery, synoptic imagery, multiobject spectra, 1D spectra, spectrophotometric time series data, high energy event data, interferometric or single dish visibility data, pulsar data, GRB data, synthetic (model-generated) data, simulated data, and so forth. Data may be fully calibrated or may be raw instrumental data including instrument-specific metadata. Datasets may be produced by the combination of instrumental datasets, e.g., SED data, or dithered and stacked mosaic imagery.

To deal with such data in any general way requires a formal specification of the data model for the data. For astronomical data we call this the science data model (SDM). All data processing and analysis is defined in terms of the SDM for the data being operated upon. To store and transport data we must also define a data representation in some concrete external storage format such as FITS or XML. This mapping of the SDM to a specific data representation is called an export data format (EDF) for a particular class of data.

To process large, complex datasets it is generally necessary to go one step further and implement a class library for a particular type of data. This requires a choice of language (C, C++, Java, etc.) and often implies a dependence on lower level software, e.g., class code for a FITS container or an XML parser. Issues such as efficiency and scalability (for very large datasets) are generally dealt with in the class library. Ideally the implementation of such a class library is such that it can be re-used by multiple data processing "systems" (class libraries used to build components). In the general case, a data access object (DAO) may consist of an externally defined, formal data model, a storage representation, a class library, a DAO component which can interface to an execution framework, and an API which can be called by science code in a computational component. Given all this, multiple groups can then proceed to build interoperable data processing and analysis software for a given class of data.

Data models fall into two broad classes, data models for calibrated data, and data models for raw instrumental data. In general, VO is responsible for defining standard data models for standard classes of calibrated data, as well as standards for observational metadata and physical data characterization which are generally applicable to all astronomical data. The observatories which produce instrumental data are necessarily responsible for defining both the SDM and EDF for data from a specific instrument or survey project. Both of these are external interfaces which must be defined formally as external interfaces in order to allow reliable processing of data products by the general astronomical community.

VO/Grid Interface

As noted in the introduction above, VO is mainly about middleware and how widely distributed components talk to each other and interoperate. In the VO context, data and computational resources are exposed as services which are called by remote Web clients or as nodes within a Grid workflow. For the most part VO is not concerned with how computationally intensive services or client data analysis programs are implemented, but that is precisely our concern here.

VO Service. In this case we have some computation which we wish to expose to the VO as a Web service. Examples might be computation of a virtual data product for a data access service, or an analysis function of some sort. The basic strategy is to implement the Web interface with conventional Web technology (e.g., Java/Tomcat/AXIS), and use the data analysis system to handle the complex scientific computation required to generate the virtual data product to be returned by the service. For the purposes of our discussion here lets assume we are using Java for the Web interface; other technology such as .Net could be used as well given the necessary interfaces.

  • The Web service interface is the presentation layer in our architecture, responsible for the interface of the processing engine to the caller (some VO client).

  • The Web service functionality and related processing such as database access, authentication, etc. could be handled directly in the Java code which is well suited for this type of processing.

vo-integ.gif

  • Java is interfaced directly to the execution framework hence it is straightforward to drive the processing engine from a Web service front end.

  • Computationally expensive, long running services are possible using the asynchronous services technology currently being developed for the VO.

  • Computation of science data would be performed by a conventional application, e.g., a Python script or Java program, using the framework to execute computational components. The same application might be available in other contexts as well, e.g., from an interactive CLI.

  • Aside from the Web service interface and parts of the top level application, all of the code involved is the same code we would use for other applications such as desktop data processing and analysis or a pipeline.

In general the components used in a data analysis system are too fine grained to be usefully exposed to the VO as Web services. Rather we usually need to write a new application which is designed to be used in the VO/Grid context, and expose this as a Web service. The application itself is written in the application layer (e.g., in Python or Java), using other applications plus the framework and conventional components to do the processing. Such applications may be complex and may involve extensive processing using many individual components.

Almost any computation could be exposed as a Web service in this way. Since the framework is scalable this would provide a way to execute arbitrary bits of computation remotely for large scale problems, e.g., to move the computation to the data. Given two copies of the system, one local and one remote, with the same component packages installed in both locations, it would be possible to dynamically deploy an application script interfaced as a Web service to run remotely. The script could be developed and tested locally and then deployed remotely, possibly on a much larger system.

VO Client. In this case we have an application which wishes to make use of VO resources as part of some computation. Probably this is some sort of data analysis application which needs access to remote data or possibly computation.

  • Once again: science software, for various reasons such as maximal re-use, does not want to know about external details such as frameworks and protocols, so we want to hide the details of how to talk to the VO. A high level interface of some sort is required to hide details such as use of a Web services interface.

  • Some VO functionality can be made directly available to the data analysis system by interfacing the corresponding Web service to the framework as a component (task or DO). Such functionality would be available to any client application using the same interface provided for locally executing components. (This same mechanism would be used to execute application scripts remotely as mentioned above).

  • More complex services such as the client side of the VO data access layer (DAL), require some client-side functionality beyond just what is required to interface to a Web service. This should hide details such as query handling, authentication, asynchronous services, and caching of data fetched from a remote VO service.

For example, when a science application does a query, it usually does not want to get a VOTable back, rather the application would like some client-side code to submit the query, parse the response, and provide a high level API to the application to iterate through the query response line by line (much as a client would query a conventional database). Probably this would be implemented by a service running locally, which knows how to talk to remote VO services, but which can also talk to local data analysis software via the execution framework.

When accessing remote data, the local VO client-side service would not only fetch the data but cache it locally. All access to the data for data analysis would then be to the cached local copy of the data. In combination with capabilities for asynchronous data staging it would be possible for a remote data access service to deliver the data directly to the client-side cache without ever storing it on the server. By providing Web access to cached data the same mechanism could be used to expose data back to the VO for consumption by remote clients, for example to return data products from a local data access or analysis service. Within the local data analysis system, equivalent access would be provided for local and remote data.

The client-side data caching functionality described here is analogous in some respects to capabilities such as "VOSpace" in VO, since both facilities have the ability to cache data products as part of a VO workflow. The chief difference is in the client interface for data analysis and the need for transparent integration into a data analysis system. The VO side of the interface should adhere to VO standards, e.g., by implementing a future VOCache API and by using relevant standards for Grid and Web services.

Summary

The software architecture explored here is intended primarily to provide a processing engine for complex, data and computationally intensive scientific computation. A high level application, usually implemented as a script, drives computation via a distributed set of computational components controlled by a framework. The software has an open architecture, allowing a great variety of systems to be produced from the same underlying components.

Now that we have a software architecture in mind, we can go back and reconsider our various users and see how well a system based on the proposed architecture would server their needs:

  • Astronomer. Processing of user or workgroup data is integrated with data analysis of user and public archival data. Various user-oriented tools, such as a CLI, advanced scripting environment, GUIs, and user-configurable software mix, are provided, executing efficiently in the desktop environment. Data and computation can be local or remote. The biggest challenge is probably managing complexity and keeping the resultant software sufficiently well integrated, lightweight, and "astronomer friendly" (particularly for user software development) to be successful in the desktop environment.

  • Developer. The science software developer works primarily at two levels: as a component developer, or as an application developer. Components can be developed and tested at the host level with minimal knowledge required of the system framework. Application scripting in an environment such as Python is straightforward, attractive, and familiar to most developers. The framework should be largely transparent, with most of the developer's attention focused on components.

  • System Integrator. The system integrator can use as much or as little of the standard software as they need to deliver the system they produce. The software mix and look and feel of the overall system are under the control of the system integrator. System integrators can use only components, or an execution framework plus components, or entire applications, customizing the presentation layer as needed to produce the desired system integration and functionality.

The system described can function either stand-alone, on a desktop or workgroup cluster, or as part of the VO. The basic strategy for integration with the VO is twofold: as a data analysis client which serves as a portal to the VO, or as a computational node which exports services to the VO.

-- DougTody - 6 Dec 2004


Discussion

(Add comments here)

Initial version. -- DougTody - 28 Nov 2004


From Face to face mmeeting 2/3 Dec 2004:

Following discussion, there was good agreement on the draft architecture, and the following block diagram was drawn in order to clarify a number of points.

    ----------------     --------------
   |       GUI      |   |     CLI      |
    ----------------     --------------
    /|\        |                |
     |         |                |
     |      --------------------------------
     |     |  Controller Task/Application   |
     |     --------------------------------
     |      /|\                       |
     |       |                        |
     |       |   --------------------------------------------------
     |       |   | DVM     |             |Package    |Persistence |
     |       |   |----------             |MAnager    |------------|
     |       |   |                       |           |Discovery   |
     |       |   |           FRAMEWORK   |-----------|------------|
     |       |   |                                   | Logging    |
     |       |   |------------------------------------------------|
     |       |                        |
     |       |                        |
     |       |  ---------------------------------------------------
     |       |  |                                    |Logging     |
     |       |  |          CONTAINER                 |------------|
     |       |  |                                    |Lifecycle   |
     |       |  |                                    |------------|--------------------
     |       |  |                                    |Messaging   | IPC/WebServ/CORBA |
     |       |  |                                    |------------|--------------------
     |       |  |                                    |2-way parameter handling |
     |       |  |---------------------------------------------------------------
    \|/     \|/   |               |                   |
    --------------       ----------------       ----------------------
    |  UNIT TASK  |      | VO Client    |      |   Controller Task   |
    --------------       ----------------       ---------------------

The next steps will be to elucidate the Parameter Interface? - probably an XML encoding with associated Schema. This will probably be a minimal interface, but enough to allow a prototype involving interoperability of legacy applications form Starlink, IRAF, MIDAS and AIPS++.

-- DavidGiaretta - 03 Dec 2004


Updated to reflect discussions from the Dec 2-3 Garching meeting. Added material in the following areas:

  • Container services (messaging, logging, etc.)
  • A bit more on parameter sets
  • Discussion of scalability issues
  • Discussion of user interface architecture
  • Discussion of data model issues

-- DougTody - 6 Dec 2004


Added a paragraph to the discusssion of system integration illustrating how Linux is as an analogous example of the open system architecture presented here. Added a comparision of VOSpace to the client-side VO interface discussed here. Minor tweaks elsewhere.

-- DougTody - 2 Jan 2005



to top

I Attachment sort Action Size Date Who Comment
container.pdf manage 193.3 K 10 Feb 2005 - 23:52 DougTody Component Container Architecture draft 0.3

You are here: Main > EnvironmentRequirement > ArchitectureAndFrameworks

to top

Copyright © 1999-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding OPTICON TWiki? Send feedback