Tags
distributed computing, ehealth, federation, free software, gnu, GNU Health, GNU Health Federation, GNUHealth, parallel computing, performance, technology, thalamus, tryton, wordpress
When it comes to large volume of data management, health in general and health informatics in particular are in the top of the list. In this post I’d like to bring the attention on how we can create scalable models in GNU Health using parallel and distributed computing methods.
In the old days – and even today – large areas of the hospitals are dedicated exclusively to store patient medical records. Thousands of charts that make millions of pages.

The advent of Hospital information systems (HIS) and Electronic Medical Records (EMR) are transforming those paper based records into bits and bytes. The GNU Health Hospital and Health Information System is one example.
GNU Health has many areas that involves loading, processing, searching and transforming large sets of data. Here are some examples that we use in GNU Health daily:
- Demographics: Individual identification means, gender, addresses, occupations, domiciliary units, insurances, health professionals, institutions
- Medical records: Patient evaluations, hospitalizations, laboratory and medical imaging orders, prescriptions, medication
- Coding standards: Datasets that involve coding standards for interventions, procedures (ICPM, ICHI, ICPM..), pathology, health conditions (ICD10, ICD11..)
- Genomics: Very large datasets involving DNA sequencing, natural variants, genes, …
- Epidemiology: Statistics are key in early warning systems, outbreak detention and health promotion, disease prevention programs. Those reports can involve massive amount of data to be processed.
I would like to stress the importance of a good parallel or distributed computing model for maximum scalability and performance. One of the main problems is that we have the tendency to emulate in computing our linear lives. The society in which we live in make our daily activities are a set of sequential chronological (dull) tasks (wake up -> bathroom -> breakfast -> work -> […] -> dinner -> sleep) put into a loop.

Think parallel. Instead of that, I’d like to think in terms of how our body systems work internally. From the macroscopic organs to the minute hormones and neurons, working simultaneously in beautiful synchrony to maintain homeostasis, the internal equilibrium that keep us alive and well. It would be impossible to make a linear, sequential loop to process the events happening in a single second of our lives. Parallel processing makes the miracle. All the “workers”, “processes” and their signaling (“IPC” interprocess communication in computer science terms) make it happen.
A real life example: If we don’t do a good design, the project will not scale. Maybe, at the beginning, with a few records, our system will perform ok. With time, our database will become larger and if initially we had one hundred patients, and all of the sudden, we have reached 1 million. Each person and patient in that million population set has their own medical record, demographic history, lab tests.. you get the idea… doing analytic reporting, exporting or importing data will not scale if we don’t have a good design.
The following is a real life example that involved the migration to the latest version of GNU Health HIS of our community server. checking and syncing the values stored on the datasets residing on the filesystem (for instance, updating to the latest version of the UniProt human genes natural variants) with those in the database. In total, we had near 150,000 records to sync. GNU Health HIS uses Tryton, a great Free/Libre framework on top of Python and PostgreSQL. What it might seem a trivial task, it’s not. When we increase the verbosity, syncing each record involve a lot of tasks such as login in, checking user permissions on the model, status of the record, verify that it was not changed after the last update, etc.. If we had 100 records, we may afford linear processing. With a set of 150K, we must look for a parallel computing solution.
I have experienced similar situations when we have to migrate the medical records from another system to GNU Health. The initial batch input upload might contain thousands / millions of records. Making a good parallel model design will transform days into hours, hours into minutes and minutes into seconds.

The GNU Health Federation: Distributed computing for large health networks.
The GNU Health Federation is another example of how to create scalable systems in health. In this case, instead of using multiple processes within a single computer, we are setting multiple “workers” that we call nodes across a province, country or region. A node can be an individual using MyGNUHealth personal health record, a laboratory or a hospital. Each of them work independently and they can communicate via the network. Data aggregation and reporting will happen at the GNUHealth Health Information System server, a special, document-oriented PostreSQL database.

Summary: Make a big problem small. Think parallel.
In the end, whether you use multiple processes in the same computer or make different nodes in the health network, the concept is pretty much the same. Make a big problem small. The PCAM design methodology is a great start. PCAM stands for Partition, Communicate, Agglomerate and Map. Decompose the initial problem in smaller domains (data) and functional (computational) units, design the way they talk to each other, combine (agglomerate) the tasks and finally map those tasks to processors.
It is also important to know your resources so you can dimension and design the solution to the problem. For instance, in the sync data example, we can see that spawning too many processes will yield in a degraded system. We have saturated our resources and the system spends more time waiting for I/O or trying to make the processes communicate to each other. You may then use use processes, threads or even distributed computing, which are different implementation methods to fit the context and your resources.
Conclusion: As a final thought, I’d like to make emphasis not in the computing power, but in the power of open science and solidarity as a community. Computers can definitely help us achieve our goals, but the most efficient parallel / distributed model resides in the human factor. Today we are living in unjust a world ruled by a very few yet very powerful people and corporations. Concentration of power and computational resources will only benefit a few, creating more inequality and social gradient. Humanity is reaching a new low and we can not normalize the killing of thousands of innocent children that is happening in front of our very eyes. We can not permit our governments prioritizing the macabre business of war instead of the human rights flag. The scientific community must rise up and organize for peace, social justice and equity in our society.
Open science, cooperation, solidarity and empathy are they key to success to any problem, no matter how big they may be.
Happy hacking
You must be logged in to post a comment.