The role of big data in genetic testing: Infrastructure, data lakes, and ethical management of genomic information
Synopsis
The massive funding that has been invested in genomics has made it a field of data science fundamentally allied with computer science. A consequence has been the development of an infrastructure of open-access public repositories such as the Sequence Read Archive, the European Nucleotide Archive, or the DNA Data Bank of Japan. Such repositories are essential for the advance of the field by efficiently storing and sharing genomic information while ensuring privacy protection and data confidentiality. However, it is also necessary to anticipate potential ethical risks according to the newest multidisciplinary paradigm in genomics. Firstly, security hazards are expected, such as the kind of attack articulated distributed denial of service known as ‘doxing’. Other genetic information management efforts must be directed toward the strict regularization of the treatment and storage of genetic data and to preventive endpoint measures. In this last respect, it is of utmost importance that genetic data is never directly part of the final outputs without anonymization, but rather that it is accessed through other intermediate automation or analysis. Nonetheless, the desirability of these data will certainly attract the development of an artificial intelligence irretrievable watermarking layer of data storage peripherals, with direct blocking mechanisms installed in special hardware components of networks or climax computing servers.
Each cohort member is genotyped at a set of Single Nucleotide Polymorphisms (SNPs) present in a microarray-based chip. Around one million SNPs per individual are genotyped giving rise to the impressive volume of data to be analyzed. These SNPs are spread all over the genome, so in turn, genetic data management involves linking these genetic variants to biological features, such as genes or pathways. In addition, genotyping several individuals at a time leads to hard computational effort in terms of data management and manipulation. Therefore, the analysis of such data is running on infrastructures hosted by High Performance Computing (HPC) centers. Since most researchers have no access to these Computer Facilities, the need for user friendly gateways is high on demand. Most data and results are kept in public repositories. Data management issues range from the internationally accepted form of sharing GWAS datasets to the design of better infrastructures or data lakes, able to hold and relate the large number of entities in use. A wide range of services are offered to the academic community: analysis-ready datasets, tools for data manipulation, lists of known associations or the post of unpublished research results. A specific kind of data that is being stored and shared refers to genetic codification itself. Care must be taken to ensure privacy protection and data confidentiality, which makes this kind of data less openly available.