Table of Content:
- Aims of bioinformatics
- Biological databases
- Importance of biological database
- Primary Databases
- Secondary Databases
- Primary database types
- Secondary database types
- Some Examples of Primary Databases
- 1. GenBank
- 2. EMBL (European Molecular Biology Laboratory)
- 3. Swiss-Port
- 4. Protein Information Resource (PIR)
- Some Examples of Secondary Databases:
- 1. Motif Databases
- 2. Domain Databases
- 3. 3D structure databases
- 4. Gene Expression Databases
- 5. Metabolic pathway databases
- 6. Genome Databases
- 7. Virological databases
- 8. World biodiversity databases
Aims of bioinformatics
- Development and use of database containing all biological information.
- Development and use of better tools for data designing, annotation and mining.
- Design and development of drugs by using simulation software.
- Design and development of software tools for protein structure prediction function, annotation and docking analysis.
- Creation and development of software to improve tools for analysing sequences for their function and similarity with other sequences
- Analyze vast data to understand cellular functions, growth, development, and disease.
- Use sequence or structural data to predict the role of new genes and proteins.
- Analyze protein structures to identify drug targets and simulate drug interactions.
- Analyze genomes to understand relationships, conserved genes, and life's history.
- Develop methods for storing and retrieving massive biological datasets.
- Analyze individual genetic makeup to predict disease risk and tailor treatments.
Biological databases
- Biological data are complex, vast, and incomplete. Therefore, several databases has been created and interpreted to ensure unambiguous results.
- A collection of biological data arranged in computer readable form that enhances the speed of search and retrieval and convenient to use is called biological database.
- A good database must have updated information.
Importance of biological database
- biological sequences
- structures, binding sites
- metabolic interactions
- molecular action
- functional relationships
- protein families
- motifs.
Primary Databases:
- Origin of Data: Store original, experimental data
directly submitted by researchers. This data could be:
- DNA or protein sequences
- Macromolecular structures
- Microarray data
- Genotype data
- Examples:
- GenBank (DNA sequences)
- UniProtKB (protein sequences and
annotations)
- PDB (Protein Data Bank) (3D structures of
proteins and nucleic acids)
- GEO (Gene Expression Omnibus) (microarray
data)
- dbSNP (Single Nucleotide Polymorphism
database) (genotype data)
- Purpose: Act as an archive for raw experimental data, ensuring its
accessibility and reproducibility for future research.
- Data Format: Structured and standardized format to
facilitate easy search and retrieval.
- Data Annotation: Minimal annotations, focusing on data
description and provenance (origin).
Secondary Databases:
- Origin of Data: Derive data from primary databases and
integrate it with information from other sources. They perform various
analyses and interpretations on the raw data.
- Examples:
- KEGG (Kyoto Encyclopedia of Genes and
Genomes) (pathway maps and biological networks)
- SWISS-PROT (protein function, structure,
and sequence information)
- Ensembl (annotated genomes)
- ViruSITE (viral genomics database)
- Purpose: Provide a higher level of analysis and interpretation of the data,
offering insights into biological functions, relationships, and processes.
- Data Format: Structured and integrated with additional
annotations and functionalities for searching, browsing, and analysis.
- Data Annotation: Rich annotations, including functional
information, protein-protein interactions, and links to relevant
publications.
Here's an analogy to
illustrate the difference:
- Primary Database: Like a library's archive, where original
research papers are stored.
- Secondary Database: Like a literature review article or
textbook, summarizing and analyzing information from various primary
sources.
Choosing the Right
Database:
- Use Primary Databases: When you need access to the raw,
uninterpreted experimental data for further analysis or verification.
- Use Secondary Databases: When you need a more comprehensive view
of biological information, with interpretations, annotations, and
functionalities for exploring relationships and functions.
Primary database types
- Sequence Databases: These databases store nucleotide and protein sequences obtained from various organisms. Examples include the National Center for Biotechnology Information (NCBI) GenBank, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database (ENA), and UniProt for protein sequences.
- Structure Databases: These databases store three-dimensional structures of biological macromolecules, such as proteins, nucleic acids, and complexes. Examples include the Protein Data Bank (PDB) and the Electron Microscopy Data Bank (EMDB).
- Genomic Databases: These databases contain genomic information, including assembled genomes, annotations, and genetic variation data. Examples include the Ensembl Genome Browser, UCSC Genome Browser, and the Genomic Variants Database (dbSNP).
- Expression Databases: These databases store data on gene expression levels across different conditions, tissues, and developmental stages. Examples include the Gene Expression Omnibus (GEO), ArrayExpress, and the Cancer Genome Atlas (TCGA).
- Protein Interaction Databases: These databases contain information on protein-protein interactions, protein complexes, and signaling pathways. Examples include the Biological General Repository for Interaction Datasets (BioGRID), STRING, and IntAct.
- Metabolic Pathway Databases: These databases store information on metabolic pathways, including enzyme reactions, metabolites, and pathway maps. Examples include KEGG, Reactome, and MetaCyc.
- Ontology Databases: These databases store controlled vocabularies and hierarchical relationships to standardize the annotation of biological data. Examples include the Gene Ontology (GO) database and the Human Disease Ontology (DO).
- Literature Databases: These databases index scientific literature and provide tools for searching and accessing research articles relevant to specific topics. Examples include PubMed, PubMed Central (PMC), and Google Scholar.
- Clinical Databases: These databases store clinical and healthcare data, including patient records, disease registries, and clinical trial information. Examples include ClinVar for clinical variants, ClinicalTrials.gov for clinical trials, and OMIM for human genetics and phenotypes.
- Microarray and Next-Generation Sequencing (NGS) Databases: These databases store data generated from high-throughput technologies, such as microarray experiments and NGS studies. Examples include GEO for microarray data and the Sequence Read Archive (SRA) for NGS data.
Secondary database types
- InterPro: Integrates protein sequences into families and predicts the presence of domains and important sites within them.
- Swiss-Prot and TrEMBL: Annotated protein sequence databases that provide comprehensive information on protein function, structure, and interactions.
- RefSeq: Curates and annotates a comprehensive collection of reference sequences for genomes, transcripts, and proteins from various organisms.
- OMIM (Online Mendelian Inheritance in Man): Curates information on human genes and genetic disorders, including phenotypic descriptions, inheritance patterns, and molecular mechanisms.
- UCSC Genome Browser: Provides a comprehensive collection of genome sequences and annotations for various organisms, along with tools for visualizing and analyzing genomic data.
- ENSEMBL: Integrates genomic, transcriptomic, and proteomic data for various organisms, providing genome browsing, gene annotation, and comparative genomics tools.
- Reactome: Curates and annotates human biological pathways, providing detailed information on molecular interactions, signaling cascades, and metabolic pathways.
- dbGaP (database of Genotypes and Phenotypes): Archives and distributes genotype and phenotype data from human genetic studies, facilitating research on the genetic basis of complex traits and diseases.
- Pfam: Curates a comprehensive collection of protein families and domains, providing annotations, alignments, and hidden Markov models for protein sequence analysis.
- KEGG (Kyoto Encyclopedia of Genes and Genomes): Integrates genomic, chemical, and functional information for various organisms, including pathways, diseases, and drugs, to facilitate systems biology and drug discovery research.
Some Examples of Primary Databases:
1. GenBank:
- One of the fastest growing repositories of known nucleotide sequences, GeneBank (Genetic Sequence Databank), has a flat file structure. It is an ASCII text file, readable by both humans and computers. Besides sequence data, GeneBank files contain information such as accession numbers and gene names, phylogenetic classification and references to published literature.
- While GenBank's data was originally stored in a flat file format, it has since transitioned to a more complex structure using XML and ASN.1 formats for improved manageability and data exchange.
- This database has been developed and maintained at the NCBI, Bethesda, MD, USA, as a part of International Sequence Database Collaboration (INSDC).
- It is an open access sequence database.
- It coordinates with individual laboratories and other sequence databases like EMBL and DDBJ.
- It is an annotated collection of all nucleotide sequences that are available to the public.
- The nucleotide database was divided into three databases at NCBI: CoreNucleotide database, Expressed Sequence Tag (EST) and Genome Survey Sequence (GSS).
- CoreNucleotide database has most of the nucleotide sequences used. It also encloses all nucleotide records that are not in the EST and GSS databases.
- Submission of sequences to GeneBank can be done using BankIt, Sequin and tbl2asn tools.
2. EMBL (European Molecular Biology Laboratory):
- A comprehensive database of DNA and RNA sequences, EMBL nucleotide sequence database is collected from scientific literature, patient offices and is directly submitted by researchers. EMBL has been prepared in collaboration with GeneBank (USA) and the DNA Database of Japan (DDBJ).
- It is established in 1980.
- It is maintained by EBI (European Bioinformatics Institute)
- This is a curated protein sequence database that offers a high level of integration with other databases and also has a very low level of redundancy.
- Swiss-Port strives to provide protein sequences with a high level of annotation (for instance, the description of protein function, domain structure and post translational modifications, etc.).
- It is established in 1986 and maintained collaboratively , since 1987, by the department of Medical Biochemistry of the University of Geneva and the EMBL data Library.
- TrEMBL is a computer–annotated supplement of Swiss-Port that contains all translations of EMBL nucleotide sequence entries, which is not yet integrated in Swiss-Port.
- Currently Swiss-Port have 0.5 and TrEMBL have 7.6 milliom sequences.
- PIR is an integrated public bioinformatics resource to support genomic and proteomic research and scientific studies. Nowadays, PIR offers a wide variety of resources mainly oriented to assisting the propagation and consistency of protein annotations like PIRSF, ProClass and ProLINK.
- The current successor to PIR is UniProtKB, which combines information from Swiss-Prot (highly curated protein sequences), TrEMBL (computer-annotated sequences), and PIR's legacy data.
Some Examples of Secondary Databases:
1. Motif Databases:
- Secondary databases that focus on
identifying and classifying short, recurring patterns of amino acids
(motifs) within protein sequences.
- These motifs are often associated with
specific protein functions or structural features.
- Motif databases offer valuable tools for
predicting protein function, especially for uncharacterized proteins.
Examples of Motif
Databases:
- PROSITE:
- Provides curated documentation entries
describing protein domains, families, and functional sites.
- Includes associated patterns and profiles
(signature sequences) to help identify these motifs in new protein
sequences.
- PRINT:
- Focuses on protein fingerprints, which
are groups of conserved motifs that uniquely characterize a protein
family.
- By identifying these fingerprint motifs
in a new protein sequence, researchers can gain insights into the
protein's family and potential function.
2. Domain Databases:
- Secondary databases that focus on
identifying and classifying protein domains within protein sequences.
- Protein domains are independently folded,
structurally stable units often associated with specific functions or
evolutionary relationships.
Examples of Domain
Databases:
- ProDom:
- ProDom is an automatically generated
protein domain database derived from the Swiss-Prot and TrEMBL sequence
databases.
- It identifies domains based on sequence
similarity searches.
- SMART:
- SMART (Simple Modular Architecture
Research Tool) is a highly reliable and sensitive tool for domain
identification.
- It uses a combination of hidden Markov
models (HMMs) and profile searches to identify domains in protein
sequences.
- COG (Clusters of Orthologous Groups):
- Primarily a functional classification
tool - While COG can identify domains and motifs, it's primarily a
functional classification tool.
- It groups proteins with likely
orthologous relationships (genes from different species that evolved from
a common ancestor) across various organisms.
- By identifying shared domains or motifs
within COG groups, researchers can infer functional similarities.
3. 3D structure
databases
1. Primary vs.
Secondary Databases:
- PDB (Protein Data Bank): This is indeed a primary database.
It stores the raw experimental data (3D atomic coordinates) for protein
and nucleic acid structures determined by X-ray crystallography, NMR, and
other methods.
- SCOP (Structural Classification of
Proteins) & CATH (Class, Architecture, Topology, Homology): These are both secondary databases.
They don't store the raw structural data but use the information from PDB
to classify protein structures based on different hierarchical schemes.
Here's a breakdown
of each database:
- PDB:
- Primary database for 3D macromolecular
structures.
- Offers raw experimental data (atomic
coordinates) and related information.
- SCOP:
- Secondary database for protein structure
classification.
- Classifies structures based on hierarchy
of structural and evolutionary relationships (folds, superfamilies,
families).
- Uses PDB data for classification.
- CATH:
- Secondary database for protein domain
structure classification.
- Classifies structures based on a
hierarchy of Class, Architecture, Topology, and Homology.
- Uses PDB data for classification.
- These databases store and organize data on gene expression levels in various biological samples.
- Focus: Understanding how genes are regulated and their roles in different cell types, tissues, and conditions.
- GEO (Gene Expression Omnibus):
- A curated online repository for gene
expression data from various high-throughput functional genomics
experiments, including microarrays and sequencing data.
- Users can browse, query, and retrieve
data for specific genes or experiments.
- GXD (Gene Expression Database):
- A resource specifically focused on gene
expression information in the laboratory mouse.
- It integrates data from various sources
and provides annotations to aid in understanding gene function and
regulation during development and other biological processes.
- MGED (Microarray Gene Expression Data
Society):
- While MGED itself is not a database, it
functioned as a community resource that established standards and
guidelines for data management and exchange related to microarray
experiments.
- This standardization has facilitated the
creation and use of gene expression databases like GEO.
- ArrayExpress (European Bioinformatics
Institute):
- A repository for transcriptomics data,
which refers to the study of an organism's entire RNA profile (including
mRNA, tRNA, rRNA, etc.).
- ArrayExpress accepts data from various
high-throughput sequencing technologies used to analyze gene expression.
5. Metabolic pathway databases
- KEGG PATHWAY Database contains graphical pathway maps for all known metabolic pathways from various organisms. Or A comprehensive database with graphical maps for various metabolic pathways across organisms.
- EcoCyc is an E. coli specific type of metabolic pathway database, stores information and focuses on the genome, biochemical pathways, and enzymes of E. coli.
- LIGAND is a component of KEGG that stores information on compounds, drugs, reactions, enzymes, at the Institute for Chemical Research, Kyoto. It is composite database currently consisting of the COMPOUND, DRUG, GLYCAN, REACTION, RPAIR and ENZYME databases.
- MetaCyc: A well-curated database focused on experimentally verified metabolic pathways from all domains of life. It is a non-redundant, experimentally elucidated metabolic pathway database.
- BRENDA is an enzyme database tat contains information on all aspects of enzymes and enzymatic reactions. Provides comprehensive information on all aspects of enzymes and enzymatic reactions, including function, regulation, kinetics, and 3D structures.
- Store and organize information about the complete DNA sequence (genome) of various organisms.
- Benefits: Offer insights into genes, their functions, and an organism's heritable traits.
- Links to Organism Databases: Some genome databases provide links to specific organism databases that offer more detailed information about genes, proteins, and phenotypes within that organism.
Examples of Genome
Databases:
- GOLD (Genomes Online Database):
- A comprehensive listing of complete and
ongoing genome sequencing projects worldwide.
- Provides information on the organism,
sequencing method, and links to relevant data resources.
- Genomes at NCBI (National Center for
Biotechnology Information):
- A resource within the NCBI website that
offers access to complete genome sequences and related annotations for
various organisms.
- Users can search for specific genomes,
download data, and explore gene features.
- A virological database contains all the sequences and related information of viruses of animals, plants, bacteria, fungi and archaea;
- for example, the HIV protease database.
- A committee called The International Committee on Taxonomy of Viruses (ICTV) authorises and organises the taxonomic classification of viruses. ICTVdB contains taxonomic information for over thousands of virus species.
- ViPR (Virus Pathogen Resource), Virus-Host Database, and NCBI Viral Genomes.
- Taxonomic databases are essential tools
for documenting and classifying all known species.
- They act as a global repository of
information on biodiversity, providing details like:
- Taxonomic hierarchies (classification of
organisms into categories like kingdom, phylum, class, etc.)
- Species names (scientific and potentially
common names)
- Synonyms (alternative names for a
species)
- Descriptions (morphological and other
characteristics)
- Illustrations (images or diagrams of the
species)
- References (links to scientific
literature about the species)
World Biodiversity
Databases:
- While CCINFO, STRAIN, and ALGAE are
relevant resources, they are not necessarily "world biodiversity
databases" in the broadest sense. These seem to be more specific
databases focused on:
- CCINFO: Potentially a database related to cultivated or commercially
important plant species (needs further research for confirmation).
- STRAIN: Likely a database focused on microbial strains (needs further
confirmation).
- ALGAE: This could be a database specifically for information on algae
species.
Examples of World
Biodiversity Databases:
- Here are some examples of databases that
encompass a broader range of biodiversity:
- Global Biodiversity Information Facility
(GBIF): A global network
and data infrastructure that provides open access to information on all
types of life on Earth.
- SpeciesLink: A portal providing access to a network
of species databases across various taxonomic groups.
- Catalogue of Life: An integrated online species catalogue
that aims to list all known species of animals, plants, fungi, and
microorganisms.
- Encyclopedia of Life (EOL): A species-centric online resource that aggregates information from various biodiversity databases and provides species pages with rich content.
Model Organism Databases:
- Specialized databases dedicated to
providing comprehensive information about a particular species widely used
for biological research.
- These databases act as central
repositories for data on:
- Genome sequence and annotations
- Genes and their functions
- Proteins and their structures
- Experimental data (phenotypes, mutations,
gene expression)
- Literature references and other relevant
resources
Benefits:
- Facilitate research on biological
processes, gene function, and development using these model organisms.
- Promote data sharing and collaboration
among researchers worldwide.
Examples of Model
Organism Databases:
- Escherichia coli: E. Coli Genome Centre, The E.coli index
- Arabidopsis thaliana: TAIR (The Arabidopsis Information
Resource)
- Homo sapiens: Human Genome Resources at NCBI
- Oryza sativa (rice): RGP (Rice Genome Research Programme)
- Drosophila melanogaster: FlyBase (Drosophila Genome Database)
- Mus musculus (mouse): Mouse Genome Informatics
- Danio rerio (zebrafish): ZFIN (Zebrafish Information Network)
- Saccharomyces cerevisiae (baker's yeast): SGD (Saccharomyces Genome Database)
Annotation of Gene:
Genes and Genomes:
- Genomes are the
complete genetic makeup of an organism, containing both coding (genes) and
non-coding DNA regions.
- Coding regions, or genes, are the parts of
the genome that contain instructions for building proteins, which are
essential for life processes.
Gene Annotation:
- Gene
annotation is the process of adding meaning and interpretation to raw DNA
sequences, particularly those encoding genes.
- It involves analyzing the sequence and
predicting its potential functions.
- This annotation process adds valuable
information to databases, making them more informative and useful for
researchers.
Benefits of Gene
Annotation:
- Helps identify genes and their locations
within the genome.
- Predicts the function of genes based on
sequence similarity and other features.
- Provides insights into gene regulation and
expression.
- Enables researchers to understand how
genes contribute to biological processes and diseases.
Accession Numbers:
- Unique Identifiers: accession numbers are unique codes
assigned to biological sequences (DNA, RNA, protein) deposited in
databases.
- Permanent Identification: Once assigned, an accession number
remains permanently linked to the specific sequence, allowing for clear
and consistent reference across different resources.
- Fast Assignment: databases typically assign accession
numbers within a short timeframe (often a couple of days) after receiving
a sequence submission.
Additional
Information:
- Accession numbers often follow a specific
format that varies depending on the database.
- For example, GenBank accession numbers
typically start with a letter followed by five digits (e.g., U12345).
- Accession numbers are crucial for
researchers to:
- Find and retrieve specific sequences from
databases.
- Cite sequences accurately in scientific
publications.
- Track changes or updates to a sequence in
the database (some databases use version numbers alongside accession
numbers).
Overall, accession
numbers are a fundamental component of bioinformatics databases, ensuring clear
identification and retrieval of biological sequences.
Full abbreviations, their origin, and a brief description of their use or purpose:
1. Genebank:
- Origin: National Center for Biotechnology Information (NCBI), USA.
- Purpose: A database of nucleotide sequences, including DNA sequences from various organisms, with annotations and metadata.
- Origin: DNA Data Bank of Japan.
- Purpose: One of the three major nucleotide sequence databases, alongside GenBank and EMBL-Bank, for archiving and sharing DNA sequence data.
- Origin: European Molecular Biology Laboratory (EMBL).
- Purpose: EMBL-Bank is a nucleotide sequence database that stores DNA sequences submitted by researchers worldwide.
- Origin: Swiss Institute of Bioinformatics (SIB).
- Purpose: Swiss-Prot is a manually curated protein sequence database providing comprehensive information on protein function, structure, and interactions.
- Origin: Protein Information Resource.
- Purpose: A comprehensive protein sequence database containing annotated and classified protein sequences, used for protein sequence analysis and annotation.
- Origin: National Center for Biotechnology Information (NCBI), USA.
- Purpose: A protein sequence database containing translations of all coding sequences (CDS) from GenBank, used for protein sequence analysis and annotation.
- Origin: Protein Data Bank.
- Purpose: A repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and complexes, used for structural biology and drug discovery.
- Origin: European Bioinformatics Institute (EBI).
- Purpose: Electron Microscopy Data Bank - MSD (Macromolecular Structure Database) stores 3D EM maps and models, primarily for macromolecular complexes.
- Origin: National Center for Biotechnology Information (NCBI), USA.
- Purpose: Molecular Modeling Database, containing 3D structures of macromolecules derived from the Protein Data Bank (PDB), used for structure visualization and analysis.
- Origin: Swiss Institute of Bioinformatics (SIB).
- Purpose: A database of protein families and domains, containing patterns, profiles, and motifs characteristic of different protein families, used for sequence analysis and annotation.
- Origin: Blocks Database Group.
- Purpose: A database of conserved protein sequence motifs, or "blocks," derived from multiple sequence alignments, used for protein family classification and motif discovery.
- Origin: Clusters of Orthologous Groups.
- Purpose: A database of orthologous protein groups across multiple organisms, used for evolutionary and functional analysis of genes and proteins.
- Origin: Structural Classification of Proteins.
- Purpose: A database of protein structural domains, organized into a hierarchical classification scheme based on structural and evolutionary relationships.
- Origin: Class, Architecture, Topology, Homologous superfamily.
- Purpose: A database of protein domain structures, classified into hierarchical levels based on their structural and functional characteristics.
- Origin: National Center for Biotechnology Information (NCBI), USA.
- Purpose: Gene Expression Omnibus, a public repository for high-throughput gene expression data, including microarray and RNA-seq experiments.
- Origin: Mouse Genome Informatics (MGI).
- Purpose: Gene Expression Database, focusing on gene expression patterns during mouse development, providing spatial and temporal expression data.
- Origin: Microarray and Gene Expression Database.
- Purpose: An organization promoting standards and best practices for microarray experiments and gene expression data analysis.
- Origin: Kyoto Encyclopedia of Genes and Genomes.
- Purpose: A database of biological pathways, diseases, drugs, and organisms, providing integrated information for systems biology and metabolic pathway analysis.
- Origin: National Center for Biotechnology Information (NCBI), USA.
- Purpose: A database of biochemical pathways and networks, integrating information from various sources for pathway analysis and visualization.
- Origin: Environmental Microbial Profiling.
- Purpose: A database of microbial communities and their functional profiles, derived from environmental sequencing data, used for studying microbial ecology and biogeochemistry.
- Origin: The Gene Index Project.
- Purpose: A collection of gene indices, representing transcript sequences from various organisms, used for gene discovery and expression analysis.
- Origin: Genome Sequence DataBase.
- Purpose: A database of complete genome sequences and annotations, providing resources for comparative genomics and evolutionary studies.
- Origin: G Protein-Coupled Receptor Database. Purpose: A database of G protein-coupled receptors (GPCRs), including sequence information, structural data, and ligand interactions, used for drug discovery and pharmacology research.