7 Computational implementation

Table of contents

7.1 Software development methodologies: Rapid application development and Test-driven development
7.2 Programming languages: Object-oriented programming and Python
7.3 Statistical programming languages: R and other statistical computing languages
7.4 Data storage: Text files and MySQL
References

This chapter gives an overview of the computational implementation approaches used in our research. We mainly used Python for general programming, and R for statistical programming and analysis. This chapter also briefly gives additional information about software development methodologies and data storage.

7.1 Software development methodologies: Rapid application development and Test-driven development

Many conventional software development methodologies, such as variants of the waterfall method [1], require strict documentation at the design phase, and full implementation rather than prototyping in the implementation phase. A waterfall method is a sequential design process flowing steadily downwards through the phases of the software development life cycle. These strict and non-flexible software development phases of conventional methodologies led to many failed system development projects in the 1980s and early 1990s, especially in large-scale projects. Rapid application development (RAD) [2] is a relatively new software development methodology, which aims to decrease the complexity of implementation and increase the speed of application development. There are many types of RADs, such as Scrum [3], Agile software development (Agile) [4], and Extreme Programming (XP) [5]. Most of them focus on simplifying each phase and reducing the duration of the software development cycle. One of the most suitable RADs for small or mid-size bioinformatics projects is Test-driven development (TDD) [6], which is equivalent to the test-first programming concepts of XP. TDD enforces the creation of unit tests before actual coding. Although full compliance to TDD is not necessary, making as many unit tests as possible with mock data can ensure high reliability of the application.

7.2 Programming languages: Object-oriented programming and Python

The most popular programming paradigm today is object-oriented programming (OOP). OOP uses “objects” defined by corresponding “classes”. Objects are actual data with data structure and procedures, whereas classes are definitions or templates to make objects. Many programming languages currently support full or partial OOP, and some of the popular ones are C++, Java, Perl, and Python. All of these programming languages are freely available (Table 7.1), and they are widely used in bioinformatics analysis. Table 7.1 also shows two additional languages Haskell and Go, as examples of functional programming and concurrent programming paradigms, though these paradigms are less common than OOP in general.

Table 7.1. Programming languages.

The table shows a list of freely available programming languages. “program”, “type”, and “URL for software environment” represent the name of programming language, the name of programming paradigm, and the URL for downloading software environment, respectively. OOP, Func, Conc in the “type” column represents three different programming paradigms: object-oriented, functional, and concurrent programming, respectively.

Program	Type	URL for software environment
C++ (GNU)	OOP	https://gcc.gnu.org
Java (Oracle)	OOP	https://www.oracle.com/technetwork/java
Perl	OOP	https://www.perl.org
Python	OOP	https://www.python.org
Haskell	Func	https://www.haskell.org
Go	Conc	https://golang.org

Among them, Python has emerged as one of the most popular languages in bioinformatics. Python requires no static type checking, which enhances the productivity with the RAD approach. It also offers multiple programming paradigms, therefore, it can use both object-oriented and functional programming in the same module, for instance. Moreover, Python offers two very powerful libraries for scientific computing: NumPy (https://numpy.org/) and SciPy (https://www.scipy.org/). BioPython (https://biopython.org) provides a set of libraries for biological computation to Python. A machine learning package for Python called PyML provides useful functions to build and test a SVM model [7]. We mainly used PyML to build our two-step SVM model for miRNA target prediction.

7.3 Statistical programming languages: R and other statistical computing languages

Some programming languages are specialized for statistical computing and graphics. For instance, SAS, SPSS, STATA, and R support software environments for statistical computing, whereas MATLAB and Mathematica are languages that provide statistical features. Among them, only R is open source software and freely available (Table 7.2). Moreover, R provides many additional libraries and also a comprehensive framework for high-throughput genome data analysis, called Bioconductor (https://www.bioconductor.org), hence, it is the most popular statistical computing language for bioinformatics today.

Table 7.2. Programming languages for statistical analysis.

The table shows a list of programming languages that can be used for statistical analysis. “program”, license”, and “URL” represent the name of programming language, the type of license, and the URL for organizations or institutes that provide the software, respectively.

Program	License	URL
SAS	proprietary	https://www.sas.com
SPSS	proprietary	https://www.ibm.com/analytics/spss-statistics-software
STATA	proprietary	https://www.stata.com
R	open source	https://www.r-project.org
MATLAB	proprietary	https://www.mathworks.com/products/matlab
Mathematica	proprietary	https://www.wolfram.com/mathematica

7.4 Data storage: Text files and MySQL

Handling the large size data from experiments with high-throughput technologies usually requires a method for effective data retrieval and manipulation. The easiest approach is using a text file with user-defined or pre-defined format. Some examples of popular pre-defined formats in bioinformatics are FASTA for nucleotide and peptide sequences, GFF (general feature format) for positional information with additional features in genome, and MAF (multiple alignment format) for multiple alignments.

A relational database (RDB) [8] management system offers more elaborate data storage mechanisms than simple text files. In RDB, data are usually accessed through the structured query language (SQL), and all data are stored in multiple tables. MySQL is the most popular freely available RDB used by bioinformatics projects, but other free RDBs, such as PostgreSQL or Postgres, and SQLite, are also widely used (Table 7.3).

Table 7.3. Relational databases.

The table shows a list of relational and NoSQL databases. ¹PostgreSQL Global Development Group. ²SQLite Consortium. ³Google App Engine provides BigTable accessibility.

Name	Type	Provider	URL
MySQL	RDB	Oracle	https://www.mysql.com
PostgreSQL	RDB	PGDG¹	https://www.postgresql.org
SQLite	RDB	SQLite Cons²	https://www.sqlite.org
MongoDB	NoSQL	10gen	https://www.mongodb.org
BigTable	NoSQL	Google	https://cloud.google.com/appengine³
SimpleDB	NoSQL	Amazon	https://aws.amazon.com/simpledb

In RDB, the data in multiple tables are “joined” when retrieving them together. All RDB management systems have very poor performance with joining tables at tera- or peta- byte level. Therefore, NoSQL database management systems have emerged to control data even at peta byte level. NoSQL usually avoids SQL usage and relational tables. Many NoSQL systems are available today, and some popular NoSQLs are MongoDB, Google BigTable [9], and Amazon SimpleDB (Table 7.3).

Even though handling data at tera byte level is important as more sequence data from the next generation sequencing become available, many programming languages currently lack easy-to-use libraries to access NoSQL management systems. Therefore, using both text files and RDBs rather than NoSQL is still a major practice in bioinformatics.

References

Royce WW, Royce WW. Managing the development of large software systems. Technical Papers of Western Electronic Show and Convention, 1970, p. 1–9.
Martin J. Rapid application development. Pearson Higher Education; 1991.
Rising L, Janoff NS. The Scrum software development process for small teams. IEEE Software 2000;17:26–32. https://doi.org/10.1109/52.854065.
Cockburn A, Highsmith J. Agile software development, the people factor. Computer 2001;34:131–3. https://doi.org/10.1109/2.963450.
Auer K, Miller R. Extreme programming applied: playing to win. Addison-Wesley Professional; 2001.
Beck K. Test driven development: by example. Addison-Wesley Professional; 2002.
Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G. Support vector machines and kernels for computational biology. PLoS Computational Biology 2008;4:e1000173. https://doi.org/10.1371/journal.pcbi.1000173.
Codd EF. A relational model of data for large shared data banks. Communications of the ACM 1970;13:377–87. https://doi.org/10.1145/362384.362685.
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 2008;26:1–26. https://doi.org/10.1145/1365815.1365816.