|
|
|
To-Do List
- Security. Setting aside the question of security on the configuration manager
(as that is a separate system), we need to ensure that messages sent to the various
ORCM elements (daemons and main program), plus those sent to/from applications, are
properly authenticated and protected. Otherwise, anyone could inject a multicast
message into the system.
- Complete implementation of the leader selection/failover code.
- Create a truly reliable multicast. The current multicast implementation is strictly
fire and forget. A large number of reliable multicast methods have been researched and
published over time. We are currently working on integration with the spread library, but
additional implementations are welcome.
- Application collective operations (subset of MPI, running across the ORTE/pnp transports).
This will be needed/desirable to support some database operations in a
scalable manner, and perhaps some leader selection/failover methods.
- New pt-2-pt transports (unicast/udp, shared memory).
- Shared memory at ORTE layer for data sharing. The daemons will be sharing a significant
amount of info with their local applications. At the moment, this can only be done by having
the daemon send messages to each application. This consumes time and memory as each app has
to store the data locally. The OMPI team has talked for quite some time about the desirability
of having ORTE shared memory support for common operational data - this needs to be done to support
OMPI operations too.
- Large message support for multicast. The ORTE multicast system currently does not support
messages larger than 1 MTU. Support for larger messages is desirable and probably required for
the long-term. This will require that messages be fragmented, and that the required message matching
logic be written.
- As part of the above, investigate moving the multicast framework underneath the current ORTE
RML framework. The RML is intended to operate as a multi-selection framework, so it would be possible
to have the multicast module return not-supported for p2p messages so they could be sent via the OOB
module. This would simplify the ORCM pnp code, which currently must select between multicast and p2p
routes.
- Performance enhancement. The ORTE comm channels were not intended for performance-based comm
- they were just used for admin functions where message time wasnt that important. Since we are using
them for performance-sensitive messages, we need to clean these up.
- Interface to a configuration manager (tail-f, moab, or whatever). Receive notifications when the
configuration has been changed and adjust the application configuration accordingly by starting or
stopping apps as required.
- ORTE thread safety for asynchronous communications. Currently, ORTE will only progress messages
(including receiving messages) when the application calls down into the ORTE library. This creates a
performance issue, and raises concerns about receiving data when we are not in the library. This can
be corrected, but needs some debugging effort.
|
|
|