News
2012 News & Highlights
- Our adaptive runtime system has demonstrated significant
benefit to a broad class of High Peformance Computing (HPC) applications.
April, 2012: A full-machine Jaguar test of a new Charm++ implementation provided impressive
performance gains over an earlier version. The new version features a new network
layer implementation designed for Cray's Gemini interconnect. Performance for a
100M atom NAMD run (PME every 4 steps) improved from a 26 milliseconds per step
runtime over last year's MPI over SeaStar+ numbers, to a 13 milliseconds per step
runtime with the new software and hardware.
|
2011 Highlights
- November, 2011: The Colony kernel has demonstrated significant
benefit to a broad class of High Peformance Computing (HPC) applications.
Using coordinated scheduling techniques, the Colony kernel has
demonstrated a nearly three-fold improvement in synchronizing collectives
at scales of both 10,000 and 30,000 cores. These results indicate that Linux
is a suitable operating system for this new scheduling scheme, and that this design
provides a dramatic improvement in scaling performance for synchronizing collectives
at scale. Details of the technical achievement, including experimental results for
a Linux implementation on a Cray XT5 machine, are to appear in a paper entitled
"Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism" in an upcoming issue
of the International Journal of High Performance Computing Applications (IJHPCA).
- July 2011: Colony has received an ERCAP allocation by NERSC.
Colony will utilize the Hopper machine to develop new ports of Charm++ for HPC environments.
- May 2011: Our new adaptive task mapping strategies show improvements for the Weather
Research and Forecasting (WRF) model. For 1,024 nodes, the average hops per byte reduced by
63% and the communication time reduced by 11%.
- May 2011: Developed new causal-based message logging scheme with improved performance and
scalability. We also completed the design and implementation of a new dynamic load-balancing
technique. Results for the BRAMS weather forecasting model show much higher machine utilization
and reduction of more than 30% in execution time.
- April 2011: The Colony co-scheduling kernel produced better than expected results on
running benchmarks. Performance improved nearly 3x for allreduces on MPI_COMM_WORLD
for applications of 2220 cores on ORNL's Cray XT5 (Jaguar). Results published at ROSS 2011.
- January 2011: The Colony team reached a milestone this month by booting a new operating
system kernel. Through successfully bringing up the advanced kernel on a Cray XT with a
Seastar interconnect, the team paves the way for the next phase of performance and
scalability testing. Unlike the typical Linux kernel which suffers performance drawbacks,
the new kernel is designed to provide a full featured environment with excellent scalability
on the world’s most capable machines. As coordinated stop-lights are able to improve traffic
flow, the Colony system software stack is able to co-schedule parallel jobs and thus remove
the harmful effects of operating system noise or interference through the use of an
innovative kernel scheduler. The kernel utilizes a high precision clock synchronization
algorithm developed by the Colony team to provide federated nodes with a sufficient global
time source for the required coordination. Papers describing the new kernel are in the works.
2010 Highlights
- December 2010: The SpiderCast communications work is proceeding on to
implementation. A design document is
available here.
- October 2010: The Colony team has successfully tested a new high precision
clock synchronization strategy at scale on ORNL's Jaguar computer system.
The new algorithm is a high precision algorithm designed for large leadership-class machines
like Jaguar. Unlike most high-precision algorithms which reach their precision in a post-mortem
analysis after the application has completed, the new ORNL developed algorithm rapidly
provides precise results during runtime. Previous to our work, the leading high-precision
clock synchronization algorithms that made results available during runtime relied on probabilistic
schemes which are not guaranteed to result in an answer. Our results are described in a
paper presented at the 22nd IEEE International Symposium on Computer Architecture and
High Performance Computing in October.
- Summer 2010: Experiments with our latest software show better message logging
(512 proc job had 73% reduction in message log volume). We have developed
a new synchronized clock scheme which exhibits much better performance than
previous distributed protocols. An initial design of our spidercast communications
service will be released this Summer. We have developed a new DHT (distributed
has table) service (see Tock10b). Our performance results for topology aware load
balancing proven with OpenAtom.
- January 2010: More Great News! The Colony Project was selected to receive
a supercomputing allocation through the Innovative and Novel Computational
Impact on Theory and Experiment (INCITE) program. The INCITE program
promotes cutting-edge research that can only be conducted with state-of-the-art
supercomputers. The Leadership Computing Facilities (LCFs) at Argonne and
Oak Ridge national laboratories, supported by the U.S. Department of Energy
Office of Science, operate the program. The LCFs award sizeable allocations
on powerful supercomputers to researchers from academia, government, and
industry addressing grand challenges in science and engineering such as
developing new energy solutions and gaining a better understanding of climate
change resulting from energy use.
2009 Highlights
- November 2009: There will be a Birds-of-a-Feather (BOF) meeting for FastOS
projects during the annual Supercomputing Conference in Portland Oregon.
The BOF, which will be held Wednesday Nov-18-2009 at 5:30, will include
brief presentations from many of the projects funded by the FastOS program
(see this
link for more details).
- September 2009: Colony II is officially underway! The three research teams
(ORNL, UIUC, and IBM) have received their funding and we are
now able to start the next phase of our research. (Funding was delayed to
accomodate our PI, Terry Jones, who is joining ORNL.) Colony II will be
funded for three years to study adaptive system software approaches to
issues associated with load imbalances, faults, and extreme scale systems.
2008 Highlights
- Our Project Principal Investigator, Terry Jones, will be joining Oak Ridge
National Laboratory as a Staff R&D Member in the Computer Science
and Mathematics organization. In the last few years, Oak Ridge has
dramatically increased their supercomputing facilities. Among the
production systems at ORNL is a 4096 core Blue
Gene/P machine and a 250 Tflop Cray and much larger machines are
currently being installed. Terry will be working with a team of
system software researchers who have brought about such innovations
as Parallel Virtual Machine and HPC OSCAR.
- March 2008: Great News!
The Office of Advanced Scientific Computing's Computer Science program
recently announced that they are awarding new funds to the Colony Project to continue their
collaborative work on improving high performance computing (HPC) system software stacks.
Today's system software stacks, including operating systems and runtime systems,
unnecessarily limit performance or portability (or in some cases, both). Strategies
developed by the Colony Project address a wide range of system software problems such
as operating system interference (noise) while introducing important adaptive
capabilities that free workloads from performance-reducing load imbalances.
Colony Project is a collaborative effort that includes Oak Ridge National Laboratory,
the IBM T.J. Watson Research Center, and the University of Illinois at Urbana-Champaign.
Colony began its research effort in 2005 and has received its major funding through the
DOE Office of Science Advanced Scientific Computing Research (ASCR) program (ASCR link
here, ASCR's computer
science projects link here).
- The Colony project received computer time as part of the
2008 BGW Day
. We performed a number of experiments to evaluate our latest coordinated scheduling
kernel (including parameter space studies). A report describing our tests and results is
available here.
2007 Highlights
- Scaling results from July-26-2007 experiments conducted by the Colony team on their big-pages kernel at the Sixth
BGW Day are now available in this report. Additional results in
the areas of Resource management and fault tolerance are also
available from experiments we conducted during the Fourth BGW Day. These experiments were performed on a
20,000+ core system at IBM's T. J. Watson facility.
- Compute node Linux demonstrated running NAS parallel benchmark, Charm++ application, and other programs.
- We assessed operating system evolution on the basis of several key factors related to system call functionality.
These results were the basis for a paper presenting the system call usage trends for Linux and Linux-like
lightweight kernels. Comparisons are made with several other operating systems employed in high performance
computing environments including AIX, HP-UX, OpenSolaris, and FreeBSD.
- We completed and demonstrated a prototype of our fault tolerance scheme based on message-logging [Chakravorty07],
showing that the distribution of objects residing on a failing processor can significantly improve the recovery
time after the failure.
- Our proactive fault-tolerance scheme was integrated to the regular Charm++ distribution and is now available for
any Charm++/AMPI user.
- We extended the set of load balancers available in Charm++, by integrating the recently developed balancers
based on machine topology. These balancers use metrics based on volume of communication and number of hops
as factors in their balancing decisions
2006 Highlights
- First prototype Linux solution for Blue Gene compute nodes is operational.
- We completed a detailed study of the difference in performance observed when running the same application
using either Linux or the lightweight Compute Node Kernel (CNK) on the Blue Gene compute nodes.
Included in the assessment was a study on the impact of this noise on the performance of Blue Gene.
- We assessed the effectiveness of our in-memory checkpointing by performing tests on a large BlueGene/L
machine. In these tests, we used a 7-point stencil with 3-D domain decomposition, written in MPI.
Our results are quite promising to 20,480 processors.
- Our proactive fault tolerance scheme is based on the hypothesis that, some faults can be predicted.
We leverage the migration capabilities of Charm++, to evacuate objects from a processor where faults
are imminent. We assessed the performance penalty due to incurred overheads as well as memory footprint
penalties for up to 20,480 processors.
- To accomplish our goal for Global Resource Management, we have developed a new hybrid load balancing
algorithm (HybridLB) that is designed for scientific applications with persistent computation and
communication patterns. HybridLB utilizes a load balancing hierarchical tree to distribute tasks
across processors. We demonstrated this approach can effectively deal with certain problems encountered
by centralized approaches (e.g. contention and unsatisfactory memory footprint).
2005 Highlights
- We studied the behavior of one particular source of asynchronous events: the TLB misses incurred
by dynamic memory management (which are absent in the production CNK). We modified CNK to support
dynamic memory management with a parameterized page size and analyzed the impact of different
strategies/page sizes on NAS kernels and Linpack.
- We measured the effectiveness of parallel-aware scheduling for mitigating operating system
interference (also referred to as OS noise and OS jitter in recent literature). Preliminary results
on the Miranda parallel instability code indicate that parallel aware scheduling across the
machine can dramatically reduce variability in runtimes (standard deviation decreased from
108.45 seconds to 5.45 seconds) and total wallclock runtime (mean decreased from 452.52 seconds
to 254.45 seconds).
- Our in-memory checkpointing scheme is designed.
- We analyzed Charm++ current resource management schemes and designed new more scalable schemes.
|