Cloud computing and scalable artificial intelligence architectures for genomic analytics: Platforms, pipelines, and security
Synopsis
Cloud computing has transformed many industries, including in bioinformatics, on how scientific research and analytics are done today, and those are complementary to disciplines in biology and genetics like genomic analytics. The exponential growth in sequencing technologies has reduced genome sequencing cost by two orders of magnitude compared to 10 years ago, making it increasingly affordable for even small research groups to collect very large datasets. The transformative effect of genomic cloud-computing based solutions for the end-user offers non-expert researchers to safely and efficiently leverage their biological “wet” lab data over massive, distributed, and complex “dry” lab big data through the scalable and cloud-based analysis platforms, services, and tools by “clicking” on the web browser, even on smart devices. Traditional way to analyze data seriously constrains what can be realistically pursued in the time constraints of a single research project, or even several educational and grant cycles. The cost of purchasing the necessary software and hardware infrastructure is well out of reach for small research groups or under-funded academic institutions. Even impressive HPC resources may not be enough to tackle complex questions and perform sophisticated analyses that rival state-of-the-art research, and be under-utilized for the routine daily needs of biologists, in most cases the end-users of genomics data. In some cases, because of bioethical and administrative reasons, it is impossible to send data over the Internet or the requested services can’t be found on the market. Prohibitive cost and difficult recruitment of capable bioinformaticians in these regions presents a bottleneck in exploiting these datasets and creates a widening gap in biomedical research to major research centers in more developed countries. Spark, a major technology used in this work and the most popular computing system for big data processing, allows for a speed-up of factor of 100 times compared to traditional disk-based systems. Thirty-six new software tools and more than 36,000 Genomic Databases were released in the third year. Pre-existing analytics tools will be adapted, addressed, and improved to work within the framework of the Genomic Open-data Architecture (GOA) within the PaNlab integrated hardware-software-stack. The final high-level Object Oriented Cloud Interface (OCI) will be produced that will enable any user to leverage all available services and tools in their own “cloud agnostic” environment securely and efficiently.