中文XML论坛--bioinformatics(4)

bioinformatics(4)

发信人: happymood (土豆块儿), 信区: Bioinformatics
标题: bioinformatics(4)
发信站: 北大未名站 (2001年04月10日16:15:35 星期二), 站内信件

发信人: spaceman (estranged), 信区: LifeScience
标题: Bioinformatics, Genomics, and Proteomics(转载)
发信站: BBS 水木清华站 (Wed Nov 29 10:34:37 2000)

The Scientist 14[23]:26, Nov. 27, 2000
PROFILE
Bioinformatics, Genomics, and Proteomics
Scientific discovery advances as technology paves the path
By Christopher M. Smith
Data Mining Software for Genomics, Proteomics and Expression Data (Part 1)
Data Mining Software for Genomics, Proteomics and Expression Data (Part 2)
High-throughput (HT) sequencing, microarray screening and protein expression
profiling technologies drive discovery efforts in today's genomics and prot
eomics laboratories. These tools allow researchers to generate massive amoun
ts of data, at a rate orders of magnitude greater than scientists ever antic
ipated. Initiatives to sequence entire genomes have resulted in single data
sets ranging in size from 1.8 million nucleotides (Haemophilus influenza gen
ome) to more than 3 billion (human genome)--a single microarray assay can ea
sily produce information on thousands of genes, and a temporal protein expre
ssion profile may capture a data picture of 6,000 proteins.1
Integration of Genomica's LinkMapper with ABI's Gene Mapper
----------------------------------------------------------------------------
----
It's what you do with the data that counts, however, and that's where bioinf
ormatics takes over. Researchers in bioinformatics are dedicated to the deve
lopment of applications that can store, compare, and analyze the voluminous
quantities of data generated by the use of new technologies.
One of the original functions of bioinformatics was to provide a mechanism t
o compare a query DNA or protein sequence against all sequences in a databas
e. Several comparison algorithms have provided some successful and powerful
computational applications,2 such as Smith-Waterman, FASTA, and BLAST. Early
on, query sequences or sets of query sequences were relatively small, rangi
ng from a few to 10,000 nucleotides, and 10- to 1,000-sequence query sets. B
ecause of the proliferation and improvement of HT sequencing technologies, i
t is now common to find query sequences with 10,000 nucleotides and data set
s containing up to 1 million sequences.
The kinds of data developed and the methods for processing and analysis also
have changed. Previously, small-scale DNA sequencing projects would perhaps
generate 100 sequences (usually 50-400 nucleotides) that could be assembled
relatively easily into a contiguous DNA sequence (a contig). Today, contig
assembly may involve 1 million sequences with up to 5,000 nucleotides. The b
urgeoning fields of proteomics and microarray technologies provide another d
egree of complexity, adding multidimensional information to the biological d
ata cornucopia.
New Scientific Challenges
The exponential rate of discovery in the era of modern molecular biology has
been nothing short of phenomenal, culminating with the announcement in June
2000 that preliminary sequencing of the human genome had been completed.3 H
owever, this achievement is just a taste of the scientific successes that ar
e to come in the 21st century. As impressive as it is, the determination of
the sequence of the approximately 3.2 billion nucleotides of the human genom
e, encoding an estimated 100,000 proteins, represents only the first step do
wn a long road. Gene identification does not automatically translate into an
understanding of gene function. Although mapping and cloning studies have l
inked a number of genes to heritable genetic diseases, the true (i.e., "norm
al") function of a majority of these genes remains unknown.
This dichotomy between gene identity and function will be one source of new
research challenges in the 21st century, encompassing problems in biological
science, computational biology, and computer science. Biologists will need
to decipher the genetic makeup of genomes, map genotypes with phenotypic tra
its, determine gene and protein structure and function, design and develop t
herapeutic agents (recombinant and genetically engineered proteins, and smal
l molecule ligands), and unravel biochemical pathways and cellular physiolog
y. Tackling these biological issues will require innovations in computationa
l biology that will be met by the development of new algorithms and methods
for comparison of DNA and protein sequence, design of novel metrics for simi
larity and homology analyses, tools to outline biochemical pathways and inte
ractions, and construction of physiological models. Success in the computati
onal biology arena will require improvements in computational and informatic
s infrastructure, including development of novel databases as well as annota
tion, curation, and dissemination tools for the databases; design of paralle
l computation methods; and development of supercomputers. These latter chall
enges are particularly important, as high performance computing (HPC) and bi
oinformatics applications need to be retooled to accommodate the fast interr
ogation of a plethora of databases, comparisons between relatively long stri
ngs of data, and data with varying degrees of complexity and annotation.
The lion's share of interest and effort over the past few years has been dir
ected toward protein identification (proteomics), structure-function charact
erization (structural bioinformatics), and bioinformatics database mining. T
he pharmaceutical industry has for the most part driven these efforts in the
search for new therapeutic agents. Identifying proteins from the cellular p
ool and/or determining structure-function in the absence of concrete biologi
cal data is a daunting task, but novel technological approaches are helping
scientists to make headway on these fronts.
Proteomics: Protein Expression Profiling
Proteomics refers to the science and the process of analyzing and cataloging
all the proteins encoded by a genome (a proteome). Since the majority of al
l known and predicted proteins have no known cellular function, the hope is
that proteomics will bridge the chasm separating what raw DNA and protein pr
imary sequence reveals about a protein and its cellular function. Determinin
g protein function on a genomewide scale can provide critical pieces to the
metabolic puzzle of cells. Because proteins are involved in one measure or a
nother in disease states (whether induced by bacterial or viral infection, s
tress, or genetic anomaly), complete descriptions of proteins, including seq
uence structure and function, will substantially aid the current pharmaceuti
cal approach to therapeutics development. This process, known as rational dr
ug design, involves the use of specific structural and functional aspects of
a protein to design better proteins or small molecule ligands that can serv
e as activators or inhibitors of protein function. A recent technology profi
le in LabConsumer4 and a meeting review5 detail companies providing proteomi
cs tools.
The multidimensional nature of proteomics data (for example, 2D-PAGE gel ima
ges) presents novel collection, normalization, and analysis challenges. Data
collection issues are being overcome by sophisticated proteomic systems tha
t semiautomate and integrate the experimental process with data collection.
Improvements in the experimental technology have increased the number of pro
teins that can be identified, with consistency, within a single gel; however
, making comparisons and looking for patterns and relationships between prot
eins and/or particular environmental, disease, or developmental states requi
res data mining and knowledge discovery tools.
Finding the Needle in the Haystack
Data mining refers to a new genre of bioinformatics tools used to sift throu
gh the mass of raw data, finding and extracting relevant information and dev
eloping relationships between them.6 As advances in instrumentation and expe
rimental techniques have led to the accumulation of massive amounts of data,
data mining applications are providing the tools to harvest the fruit of th
ese labors. Maximally useful data mining applications should:
* process data from disparate experimental techniques and technologies and d
ata that has both temporal (time studies) and spatial (organism, organ, cell
type, sub-cellular location) dimensions;
* be capable of identifying and interpreting outlying data;
* use data analysis in an iterative process, applying gained knowledge to co
nstantly examine and reexamine data; and
* use novel comparison techniques that extend beyond the standard Bayesian (
similarity search) methods.
Data mining applications are built on complex algorithms that derive explana
tory and predictive models from large sets of complex data by identifying pa
tterns in data and developing probable relationships. Data mining workbenche
s also incorporate mechanisms to filter, standardize/normalize, cluster data
, and visualize results.
As a tool to identify open reading frames (ORFs) or hypothetical genes in ge
nomic data, data mining is a new twist on existing gene discovery applicatio
ns, such as programs that identify intron/exon boundaries in genomic DNA. On
e of data mining's greatest practical applications will be in the area of HT
, microarray-based gene- and protein-expression profiling, where massive dat
a sets need to be examined to identify sometimes subtle intrinsic patterns a
nd relationships. Differential gene analysis has the potential to explicitly
describe the interrelationships of genes during development, under physiolo
gical stress, and during pathogenesis. The data mining approach taken to ana
lyze microarray data is a function of experimental design and purpose. Inves
tigations analyzing defined perturbations of a given genetic stasis use hypo
thesis-testing computational methods, whereas genetic surveys and research i
nto fundamental cellular biology use statistical methods. Similarly, the sam
e methods are utilized in analyzing large-scale proteomics data sets.
An extension of data mining is the concept of knowledge discovery (KD), in w
hich the results of data mining experiments open up new avenues of research,
7 with obvious and subtle findings forming the basis of new questions from d
ifferent perspectives. Some of the more prominent data mining applications a
nd KD workbenches are described in the accompanying table.
Predicting Protein Structure and Function
Structural bioinformatics involves the process of determining a protein's th
ree-dimensional structure using comparative primary sequence alignment, seco
ndary and tertiary structure prediction methods, homology modeling, and crys
tallographic diffraction pattern analyses. Currently, there is no reliable d
e novo predictive method for protein 3D-structure determination. Over the pa
st half-century, protein structure has been determined by purifying a protei
n, crystallizing it, then bombarding it with X-rays. The X-ray diffraction p
attern from the bombardment is recorded electronically and analyzed using so
ftware that creates a rough draft of the 3D structure. Biological scientists
and crystallographers then tweak and manipulate the rough draft considerabl
y. The resulting spatial coordinate file can be examined using modeling-stru
cture software to study the gross and subtle features of the protein's struc
ture.
One major bottleneck associated with this classic crystallography technology
is the inordinate amount of time it takes to successfully grow protein crys
tals. This problem is being addressed by HT technology under development tha
t streamlines the crystallization process. This HT crystallography technolog
y performs many crystallization conditions in parallel with real-time photo-
video crystal monitoring. This enables the researcher to test thousands of c
rystallization conditions simultaneously, aborting those conditions that do
not work at an early stage and selecting "perfect" crystals suitable for X-r
ay analysis.
Efforts to bypass the excessive time needed to tweak the rough draft of X-ra
y crystallographic structures have led to the advancement of computational m
odeling (homology and ab initio modeling) approaches. These techniques have
been under development, in one form or another, since the first protein stru
cture (of myoglobin) was determined in the late 1950s.8 Computational modeli
ng utilizes predictive and comparative methods to fashion a new protein stru
cture. Ab initio methods use the physiochemical properties of the amino acid
sequence of a protein to literally calculate a 3D structure (lowest energy
model) based on protein folding. As opposed to determining the structure of
an entire protein, ab initio methods are typically used to predict and model
protein folds (domains). This method is gaining considerably, in part due t
o the development of novel mathematical approaches, a boost in available com
putational resources (for example, tera- and pentaFLOPS supercomputers), and
considerable interest from researchers investigating protein-ligand (or dru
g) interactions. Having the structure, even if only hypothetical, for a part
of the protein that interacts with a ligand, can potentially hasten drug ex
ploration research.
In homology modeling, the structural and functional characteristics of known
proteins are used as a template to create a hypothesized structure for an "
unknown" protein with similar functional and structural features. Protein st
ructure researchers estimate that 10,000 protein structures will provide eno
ugh data to define most, if not all, of the approximately 1,000 to 5,000 dif
ferent folds that a protein can assume;9 hence, predictive structure modelin
g will become more accurate and important as more and more structures are de
rived. The homology modeling approach has become very important to the pharm
aceutical industry, where expense and time are major drawbacks to the classi
cal methods of determining protein structure, even if automation shortens th
e discovery cycle. Hypothesized models provide an electronic footprint with
which researchers may computationally design various "shoes," such as inhibi
tors, activators, and ligands.10 This provides for better engineering of pot
ential drugs and reduces the number of compounds that need to be tested in v
itro and in vivo.
A variety of companies and research initiatives have undertaken these modern
approaches to 3D protein structure determination. Most produce structure pr
ediction/modeling applications useful in drug development and basic science
research, provide access to proprietary structure databases, and/or will dev
elop customized analysis services for researchers. LabConsumer will present
a profile on molecular modeling applications, including those that are key p
layers in homology modeling, early next year.
Tools for the 21st Century
Modern experimental technologies are providing seemingly endless opportuniti
es to generate massive amounts of sequence, expression, and functional data.
The drive to capitalize on this enormous pool of information in order to un
derstand fundamental biological phenomena and develop novel therapeutics is
pushing the development of new computational tools to capture, organize, cat
egorize, analyze, mine, retrieve, and share data and results. Most current c
omputational applications will suffice for analyses of specific questions us
ing relatively small data sets. But to expand scientific horizons, to accomm
odate the larger and larger data sets, and to find patterns and see relation
ships that span temporal and spatial scales, new tools that broaden the scop
e and complexity of the analyses are needed. Many of these data mining tools
are available from the companies highlighted in the accompanying table. The
se new products and those listed in a previous LabConsumer profile11 have th
e capacity to expand research opportunities immeasurably.
Christopher M. Smith (csmith@sdsc.edu) is a freelance science writer in San
Diego.
References
1. W.P. Blackstock, M.P. Weir, "Proteomics: quantitative and physical mappin
g of cellular proteins," Trends in Biotechnology, 17:121-7, 1999.
2. R.F. Doolittle, "Computer methods for macromolecular sequence analysis,"
Methods in Enzymology, Vol. 206. San Diego, Academic Press, 1996.
3. A. Emmett, "The Human Genome," The Scientist, 14[15]:1, July 24, 2000.
4. L. De Francesco, "One step beyond: Going beyond genomics with proteomics
and two-dimensional technology," The Scientist, 13[1]:16, January 4, 1999.
5. S. Borman, "Proteomics: Taking over where genomics leaves off," Chemical
& Engineering News, 78[31]:31-7, July 31, 2000.
6. J.L. Houle et al., "Database mining in the human genome initiative," www.
biodatabases.com/whitepaper.html, Amita Corp., 2000.
7. G. Zweiger, "Knowledge discovery in gene-expression-microarray data: mini
ng information output of the genome," Trends in Biotechnology, 17:429-36, 19
99.
8. J.C. Kendrew et al., "Structure of myoglobin," Nature, 185:422-7, 1960.
9. L. Holm, C. Sander, "Mapping the protein universe," Science, 273:595-602,
1996.
10. J. Skolnick, J.S. Fetrow. "From genes to protein structure and function:
Novel applications of computational approaches in the genomics era," Trends
in Biotechnology, 18:34-9, 2000.
11. C. Smith, "Computational gold: Data mining and bioinformatics software f
or the next millennium," The Scientist, 13[9]:21-3, April 26, 1999.
12. R.H. Gross, "CMS molecular biology resource," Biotech Software & Interne
t Journal, 1:5-9, 2000.
Bioinformatics on the Web
Portals to data analysis
The heart of bioinformatics analyses is the software and the databases upon
which many of the analyses are based. Traditionally, bioinformatics software
has required high-end workstations (desktop to mid-range servers) with a mu
ltitude of visualization plug-ins and/or peripheral equipment, and a user (o
r administrator) willing to routinely download database updates. The mid-ran
ge UNIX server is still the standard bioinformatics platform, though there a
re also a fair number of Microsoft Windows? and Apple PowerMac? computers. T
here are also a number of specialized platforms that integrate hardware and
custom software into a powerful data analysis tool, such as DeCypher?, produ
ced by Incline Village, Nev.'s TimeLogic (http://www.timelogic.com/); Biocce
lerator?, from Compugen Ltd. of Tel Aviv, Israel (http://www.cgen.com/); and
GeneMatcher?, manufactured by Paracel Inc. (http://www.paracel.com/) of Pas
adena, Calif. Yet the amount of time, money, and effort needed to purchase a
nd maintain the hardware, software, and databases required for bioinformatic
s research can be a considerable burden to a research laboratory.
2D-gel analysis with Compugen's Z3OnWeb.com
----------------------------------------------------------------------------
----
To circumvent many of these problems, a few commercial entities are now prov
iding fee-based bioinformatics analysis services through the World Wide Web.
These services offer several advantages over local stand-alone or server-ba
sed analyses. Because they are provided through a Web interface, these servi
ces are platform-independent and may be accessed by practically any Web brow
ser. Also, they are world accessible. No longer must researchers struggle wi
th different applications (doing the same function), different computer syst
ems, file formats, and other hurdles to access their data and results. Bioin
formatics Web portals truly provide universal access. Some of the more recen
t application service providers of Web-based bioinformatics tools are presen
ted below.
Bionavigator (http://www.bionavigator.com/), is a product of eBioinformatics
Inc., of Sunnyvale, Calif., a spin-off venture of the Australian National G
enomic Information Service. This service primarily targets academic research
ers and provides access to more than 20 databases and 200 analytical tools,
including those for database searching, DNA/protein sequence analysis, phylo
genetic analyses, and molecular modeling. Another attractive and useful feat
ure of the Bionavigator is that it can generate publication-quality result o
utput (for example, color-coded multiple sequence alignments and graphic phy
logenetic trees).
Doubletwist.com, formerly Pangea Systems of Oakland, Calif., is a major purv
eyor of annotated sequence data through its Prophecy database. DoubleTwist h
as recently added fee-based bioinformatics services through an integrated li
fe science portal. Using any one of a number of "research agents," researche
rs can analyze protein and DNA sequence data. DNA analysis tools provide for
the identification of new gene family members, potential full-length cDNAs,
and sequence homologs, whereas the protein tools include routines to identi
fy protein family associations, protein-protein interactions, and conserved
protein domains.
GeneSolutions.com, a product of HySeq Inc., of Sunnyvale, Calif., provides a
ccess to information describing proprietary gene sequences and related data
from more than 1.4 million expressed sequence tags (EST) analyzed by HySeq u
sing its proprietary SBH process. The GeneSolutions? Portfolio contains gene
sequences, homology data, and gene expression data generated by HySeq. More
than 35,000 genes are reported to have been identified and characterized in
HySeq's proprietary databases.
IncyteGenomics OnLine Research (www.incyte.com/online) provides a Web portal
to the numerous databases developed and maintained by Incyte Genomics Inc.,
of Palo Alto, Calif., and a personal workbench where researchers can store
their sequences, perform analyses, and search the company's databases.
LabOnWeb.com (http://www.labonweb.com/), developed by Compugen Ltd., is an I
nternet life science research engine providing access to a variety of gene d
iscovery tools. First introduced in December 1999, the latest version (2.0),
released in September 2000, includes a variety of tools for the prediction
of open reading frames and polypeptides (including an InstantRACE module tha
t uses public and proprietary databases to return a complete cDNA sequence g
iven an input EST), alternative splicing sites, gene function (by similarity
to protein domain profiles), and tissue distribution, among others.
Z3OnWeb.com (http://www.2dgels.com/) is another service provided by Compugen
for the analysis of 2D-gel image data using Z3 software. Researchers have t
he option of purchasing and operating the software from their own workstatio
ns or they may upload their image data to the Web-accessible Z3 platform for
analysis.
For researchers working on a nonexistent bioinformatics budget, there are st
ill a host of powerful bioinformatics applications, accessible without charg
e, on the Web. If the researcher needs only to perform one or two types of a
nalyses, and if data security, having to work through several disparate appl
ications, and output format are not critical issues then these gratis Web to
ols are a bargain.
A comprehensive listing of more than 2,300 Web-based bioinformatics tools (a
nd information sources), organized according to the type of analyses they pe
rform, is available through the CMS Molecular Biology Resource12 (www.sdsc.e
du/restools) at the San Diego Supercomputer Center, University of California
. A good place to start is at the National Institute of Health's National Ce
nter for Biotechnology Information Web site (http://www.ncbi.nlm.nih.gov/).
This server contains sequencing and mapping data for nearly 800 different or
ganisms through the GenBank database, all searchable using the BLAST tool. N
CBI also contains an ORF finder, the Online Mendelian Inheritance in Man (OM
IM) database of human genes, and a variety of other useful tools, most of th
em cross-indexed to the NCBI PubMed MEDLINE database.
--Christopher M. Smith
----------------------------------------------------------------------------
----
The Scientist 14[23]:26, Nov. 27, 2000

--
※ 来源:·北大未名站 bbs.pku.edu.cn·[FROM: 166.111.185.231]


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	2,429.688ms