Paul Doyle was born in Dublin, Ireland. He received his BSc. and MSc. from Dublin City University in 1990 and 1992 respectively. From 1993 to 2003, he worked at Sun Microsystems, where, as a senior manager he was responsible for the development of Thin Client and Blade Server technology. He has also worked as a senior product manager in CR2, a global provider of self service bank software solutions. He is currently a senior lecturer in the School of Computing performing research in DIT Kevin Street in collaboration with ITTD and CIT in the area of Astronomy Data Processing using Cloud Computing.
Cleaning Images in the Cloud.
In the area of astronomical computing it is currently possible to generate terabytes of charge-coupled device (CCD) image data on a daily basis, and this capability is shared with institutes both large and small. All CCD images must undergo a pre-processing step of calibration and cleaning prior to the generation of magnitude values for stars or other light sources. This must be done for each CCD image before their ultimate use in the construction of light curves for analysis. As the volume of data increases so does this pre-processing requirement. Existing data processing pipelines are either primarily sequential in nature, and thus fail to exploit the parallel nature of the captured data or rely on high performance computing solutions close to the dataset. As datasets grow to terabytes-per-day, sequential processing approaches create a processing bottleneck prior to the creation and analysis of photometric light curves, and require ever larger and more complex data centre solutions.
This research is focused on the calibration and reduction phase of astronomical pipelines, up to the point of creating magnitude values but prior to the production or analysis of light curves. Light curve generation and analysis is considered beyond the scope of this research. Using a reference dataset of 26GB from the Blackrock Castle Observatory (BCO) in Cork, a data processing pipeline is proposed which incorporates the characteristics of distributed computing and cloud computing, such as elasticity, parallel processing, and the utilisation of commodity computing resources. This unique pipeline framework will demonstrate how a decentralised elastic computing module can be created to process terabytes of image data per day.
This research has already led to the creation of a distributed pipeline spanning three institutes, demonstrating 98% reduction in processing time over an existing BCO processing pipeline. Further performance enhancements are sought to demonstrate the feasibility of a parallel distributed cleaning pipeline for datasets in the order of tens of terabytes per day. Research is ongoing through a series of over 300 sizing and performance experiments using a mix of the Amazon Web Servers infrastructure, the HEAnet storage infrastructure and a private cloud spanning multiple institutes of technology in Ireland. Central to this research is the use of EC2 Instances operating as worker nodes within the pipeline accessing NginX web servers serving static image files hosted on AWS EBS and the HEAnet iScsi storage farm. The pipeline is controlled using a series of Pyton scripts which launch instances from per configured AMIs which obtain work via the SQS service. The hypothesis under test is to see if it is possible to process 100TB of RAW astronomical Data using a distributed processing pipeline in less than 24 hours. Such a system could be relevent for the data processing challenges facing the Large Synoptic Survey Telescope due to come online within the next few year.
SPIE Paper 2012 Astronomical Data Processing in the Cloud
Research Associates: BlackRock Castle Observatory Cork, ITTD Tallaght