Link Search Menu Expand Document

7 Computational implementation


Table of contents
  1. 7.1 Software development methodologies: Rapid application development and Test-driven development
  2. 7.2 Programming languages: Object-oriented programming and Python
  3. 7.3 Statistical programming languages: R and other statistical computing languages
  4. 7.4 Data storage: Text files and MySQL
  5. References

This chapter gives an overview of the computational implementation approaches used in our research. We mainly used Python for general programming, and R for statistical programming and analysis. This chapter also briefly gives additional information about software development methodologies and data storage.

7.1 Software development methodologies: Rapid application development and Test-driven development

Many conventional software development methodologies, such as variants of the waterfall method [1], require strict documentation at the design phase, and full implementation rather than prototyping in the implementation phase. A waterfall method is a sequential design process flowing steadily downwards through the phases of the software development life cycle. These strict and non-flexible software development phases of conventional methodologies led to many failed system development projects in the 1980s and early 1990s, especially in large-scale projects. Rapid application development (RAD) [2] is a relatively new software development methodology, which aims to decrease the complexity of implementation and increase the speed of application development. There are many types of RADs, such as Scrum [3], Agile software development (Agile) [4], and Extreme Programming (XP) [5]. Most of them focus on simplifying each phase and reducing the duration of the software development cycle. One of the most suitable RADs for small or mid-size bioinformatics projects is Test-driven development (TDD) [6], which is equivalent to the test-first programming concepts of XP. TDD enforces the creation of unit tests before actual coding. Although full compliance to TDD is not necessary, making as many unit tests as possible with mock data can ensure high reliability of the application.

7.2 Programming languages: Object-oriented programming and Python

The most popular programming paradigm today is object-oriented programming (OOP). OOP uses “objects” defined by corresponding “classes”. Objects are actual data with data structure and procedures, whereas classes are definitions or templates to make objects. Many programming languages currently support full or partial OOP, and some of the popular ones are C++, Java, Perl, and Python. All of these programming languages are freely available (Table 7.1), and they are widely used in bioinformatics analysis. Table 7.1 also shows two additional languages Haskell and Go, as examples of functional programming and concurrent programming paradigms, though these paradigms are less common than OOP in general.


Table 7.1. Programming languages.

The table shows a list of freely available programming languages. “program”, “type”, and “URL for software environment” represent the name of programming language, the name of programming paradigm, and the URL for downloading software environment, respectively. OOP, Func, Conc in the “type” column represents three different programming paradigms: object-oriented, functional, and concurrent programming, respectively.

Program Type URL for software environment
C++ (GNU) OOP https://gcc.gnu.org
Java (Oracle) OOP https://www.oracle.com/technetwork/java
Perl OOP https://www.perl.org
Python OOP https://www.python.org
Haskell Func https://www.haskell.org
Go Conc https://golang.org

Among them, Python has emerged as one of the most popular languages in bioinformatics. Python requires no static type checking, which enhances the productivity with the RAD approach. It also offers multiple programming paradigms, therefore, it can use both object-oriented and functional programming in the same module, for instance. Moreover, Python offers two very powerful libraries for scientific computing: NumPy (https://numpy.org/) and SciPy (https://www.scipy.org/). BioPython (https://biopython.org) provides a set of libraries for biological computation to Python. A machine learning package for Python called PyML provides useful functions to build and test a SVM model [7]. We mainly used PyML to build our two-step SVM model for miRNA target prediction.

7.3 Statistical programming languages: R and other statistical computing languages

Some programming languages are specialized for statistical computing and graphics. For instance, SAS, SPSS, STATA, and R support software environments for statistical computing, whereas MATLAB and Mathematica are languages that provide statistical features. Among them, only R is open source software and freely available (Table 7.2). Moreover, R provides many additional libraries and also a comprehensive framework for high-throughput genome data analysis, called Bioconductor (https://www.bioconductor.org), hence, it is the most popular statistical computing language for bioinformatics today.


Table 7.2. Programming languages for statistical analysis.

The table shows a list of programming languages that can be used for statistical analysis. “program”, license”, and “URL” represent the name of programming language, the type of license, and the URL for organizations or institutes that provide the software, respectively.

Program License URL
SAS proprietary https://www.sas.com
SPSS proprietary https://www.ibm.com/analytics/spss-statistics-software
STATA proprietary https://www.stata.com
R open source https://www.r-project.org
MATLAB proprietary https://www.mathworks.com/products/matlab
Mathematica proprietary https://www.wolfram.com/mathematica

7.4 Data storage: Text files and MySQL

Handling the large size data from experiments with high-throughput technologies usually requires a method for effective data retrieval and manipulation. The easiest approach is using a text file with user-defined or pre-defined format. Some examples of popular pre-defined formats in bioinformatics are FASTA for nucleotide and peptide sequences, GFF (general feature format) for positional information with additional features in genome, and MAF (multiple alignment format) for multiple alignments.

A relational database (RDB) [8] management system offers more elaborate data storage mechanisms than simple text files. In RDB, data are usually accessed through the structured query language (SQL), and all data are stored in multiple tables. MySQL is the most popular freely available RDB used by bioinformatics projects, but other free RDBs, such as PostgreSQL or Postgres, and SQLite, are also widely used (Table 7.3).


Table 7.3. Relational databases.

The table shows a list of relational and NoSQL databases. 1PostgreSQL Global Development Group. 2SQLite Consortium. 3Google App Engine provides BigTable accessibility.

Name Type Provider URL
MySQL RDB Oracle https://www.mysql.com
PostgreSQL RDB PGDG1 https://www.postgresql.org
SQLite RDB SQLite Cons2 https://www.sqlite.org
MongoDB NoSQL 10gen https://www.mongodb.org
BigTable NoSQL Google https://cloud.google.com/appengine3
SimpleDB NoSQL Amazon https://aws.amazon.com/simpledb

In RDB, the data in multiple tables are “joined” when retrieving them together. All RDB management systems have very poor performance with joining tables at tera- or peta- byte level. Therefore, NoSQL database management systems have emerged to control data even at peta byte level. NoSQL usually avoids SQL usage and relational tables. Many NoSQL systems are available today, and some popular NoSQLs are MongoDB, Google BigTable [9], and Amazon SimpleDB (Table 7.3).

Even though handling data at tera byte level is important as more sequence data from the next generation sequencing become available, many programming languages currently lack easy-to-use libraries to access NoSQL management systems. Therefore, using both text files and RDBs rather than NoSQL is still a major practice in bioinformatics.

References

  1. Royce WW, Royce WW. Managing the development of large software systems. Technical Papers of Western Electronic Show and Convention, 1970, p. 1–9.
  2. Martin J. Rapid application development. Pearson Higher Education; 1991.
  3. Rising L, Janoff NS. The Scrum software development process for small teams. IEEE Software 2000;17:26–32. https://doi.org/10.1109/52.854065.
  4. Cockburn A, Highsmith J. Agile software development, the people factor. Computer 2001;34:131–3. https://doi.org/10.1109/2.963450.
  5. Auer K, Miller R. Extreme programming applied: playing to win. Addison-Wesley Professional; 2001.
  6. Beck K. Test driven development: by example. Addison-Wesley Professional; 2002.
  7. Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G. Support vector machines and kernels for computational biology. PLoS Computational Biology 2008;4:e1000173. https://doi.org/10.1371/journal.pcbi.1000173.
  8. Codd EF. A relational model of data for large shared data banks. Communications of the ACM 1970;13:377–87. https://doi.org/10.1145/362384.362685.
  9. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 2008;26:1–26. https://doi.org/10.1145/1365815.1365816.

Leave a comment