Why do other chicken EST sets appear to have more known genes in them?
It has been brought to our attention that our collection of chicken ESTs
have a much lower fraction of hits with the known databases. This
certainly seems to be the case. We have analysed a few subsets of
chicken ESTs in the public domain.
Riken set 1
|
7,410
|
57%
|
54%
|
49%
|
EMBL EST set 2
|
23,026
|
74%
|
71%
|
67%
|
Assembled set 2
|
10,068
|
67%
|
63%
|
60%
|
BBSRC ESTs
|
330,388
|
50%
|
48%
|
45%
|
BBSRC assembly
|
85,486
|
39%
|
37%
|
35%
|
BBSRC+Genbank contigs (All)
|
97,221
|
42%
|
39%
|
35%
|
BBSRC+Genbank contigs (BBSRC only)
|
73,023
|
35%
|
32%
|
28%
|
BBSRC+Genbank contigs (Genbank only)
|
8,637
|
43%
|
39%
|
34%
|
BBSRC+Genbank mixed contigs
|
15,561
|
75%
|
72%
|
70%
|
Notes:
The Riken set 1 corresponds to those ESTs published from the
Buerstedde group, whilst the EMBL EST set 2 include the Riken set
plus further ESTs deposited from the Buerstedde group not originally
listed in the paper. This set were assembled using PHRAP to produce
10,068 contings (Assembled set 2). The BBSRC project ESTs and contigs
are considered next. Finally, the last 5 rows refer to contigs
produced by assembling the BBSRC ESTs and 60,000 chicken ESTs in
Genbank to give 97,221 contigs. Only 15,561 of these are shared by
both the BBSRC and Genbank ESTs.
As can be seen, the BBSRC set has a relatively low percentage of "known"
genes in it compared to other sets. We believe this is for a number
of reasons:
-
Our collection of ESTs is much
larger than previous sets and the shear size of the collection
suggests we are less likely to find known genes. If our contig set
contained the same fraction of "known" genes as the
"Assembled set 2" then we would have over 55,000 different
genes - a number that is likely to easily exceed the total number in
the chicken genome !
-
Our libraries went through
extensive, stringent normalisation procedures which will have
removed many of the abundant (and therefore common) genes. Hence, we
discover more unusual genes (which are therefore less likely to have
matches in the database).
-
Our strategy focussed on
minimising the number of known genes. We had several stages during
our sequencing cycle where we assessed the "uniqueness" of
a library in comparison with all others, which led us to pick clones
from the most informative ones in our opinion. Thus we focussed on
tissues and libraries in particular which were not yielding lots of
well-known genes with characterised homologues.
-
Our ESTs do appear to be
unusual/rare clones. Considering the data above, contigs generated
in the BBSRC+GENBANK assembly which are shared by both projects have
a very high "hit" rate to the databases. The contigs which
are less common (ie. Unique to either BBSRC or GENBANK) have much
lower hit rates.
-
Our assembly protocol differs
slightly from others used. Although we have used PHRAP which is
believed to over-assemble contigs, we use a BLAST-based pre-filter
which attempts to keep genuinely dissimilar clones apart.
Nevertheless, this may lead to an effect on the number of contigs
which affects the number of contigs with BLAST hits. We note that in
general, the overall percentage with BLAST hits falls off when a set
of ESTs is assembled as all the redundant clones with the same BLAST
hit are all brought together in one contig.