Gene banks data

Gene banks, as described by Wikipedia, are a type of biorepository which preserve genetic material. In plants, this could be by freezing cuts from the plant, or stocking the seeds. Gene banks hold a very important value for our future, specifically with our planet undergoing environmental changes, allowing us to access and use specific seeds that would otherwise be lost in nature.

The value of these seeds is obviously important, but the ability to access these seeds is also critical. In fact, without a proper information access system these seeds would be lost in gene banks the same way they would've been lost in nature. Fortunately the top gene banks in the world do a good job in storing and managing their data, through internal software systems maintained by IT units.

However you can imagine that not all gene banks do a good job in managing their data. Some systems are as ugly as Excel spreadsheets or Word documents, located on regular desktop computers, copied around different machines, without any proper backup process or disaster recovery routine. Imagine losing this data: the location the seeds were collected in; the date; the characteristics of the plant. Without this data the seeds become useless, so it is very important that gene banks take responsibility in managing their data correctly.

Also another important issue is that this data might not be accessible through the internet. Either the gene banks have little internet connectivity, or they simply don't have the software to allow them to publish their data online.

Accessing data across different gene banks

All this inconsistency between different gene banks information systems, is not beneficial for people that are trying to access this data. A system must be built that would gather all the data from all the different gene banks and make it accessible in a consistent way.

This is not an easy task. It's like locating and gathering all the information on the internet about movies, and making it searchable in a consistent way. You could scrape data from different sites such as imdb and rottentomatoes, but you would still need to normalize part of the data. For example imdb uses stars for rating movies; rottentomatoes uses numbers. So you either convert stars into numbers, or you're stuck with two pieces of information that are semantically the same, but syntactically different.

Possible solutions

If gene banks use the same seed management system, instead of something they built on their own, then the data is bound to be stored using a standard schema. Gathering and searching this data would then be simple and it would make it much more accessible.

This would be the ideal scenario, but unfortunately forcing everyone to use a specific system will not work. There's always going to be someone more comfortable with building their own solutions. This is how business has always worked. A single system cannot accomodate everyone's needs.

Another interesting approach would be to adopt a standard schema and have the gene banks publish their data online using microformats. This is something already being adopted for other information across the web, so it makes sense to be using it for the data held inside gene banks as well.

Gene banks data

Luca Matteis / Sunday, March 25, 2012

Accessing data across different gene banks

Possible solutions

Luca Matteis