Do the libraries contain a substantial amount of contamination from genomic DNA?

We believe that the level of genomic DNA or incompletely spliced RNA (hnRNA) present in our clones is minimal, and typical of the level one would expect to find in any high quality set of ESTs. We have conducted a number of simple analyses to demonstrate this.

We looked at the well-characterised beta actin gene in detail, as the full-length DNA sequence is present in the current databanks, which details the complete exon-intron structure of the gene (EMBL-ID GGAC01, Accession X00182). We find 193 ESTs in our collection which have a stringent match to this gene sequence (%id > 96, length > 50) and have analysed these ESTs in detail to see how many have robust matches to the 4 known introns in the sequence. We find only 2 stringent matches to the introns - EST 603003869F1, which matches to intron3, and EST 603110961, which matches to intron2. Hence, we estimate the intron contamination level to be 2 in 193 or 1%. This probably reflects the fact that splicing is a dynamic process and at the time the RNA was harvested there will be a small percentage of nuclear RNA which is being actively spliced.

A more general analysis was then undertaken, whereby all intron-annotated Gallus gallus sequences from EMBL were identified, and a database of exon-bounded introns was extracted from these sequences - giving 878 separate intronic sequences in total. Using BLASTN we found the subset of our ESTs that had stringent matches to the coding regions of this dataset of intron-containing EMBL cDNAs. This gave us 6235 ESTs in total, which were then compared with the 878 exon-bounded introns using BLASTN. The purpose of this search was to be able to identify intron matches where the exonic matches also corresponded. The BLASTN results were parsed to find only high-quality hits (greater than 98% ID and at least 30 bases long) where the same EST matched both the exon and intron from a given cDNA with high stringency. There were 106 instances where this was the case, but this represented only 77 different ESTs (indeed, some have several intron hits, such as 602955228F1 which has 7 intronic hits).

Hence 77 out of 6235 ESTs have confirmed intronic hits, which is 1.2% of this EST subset. We expect this figure to be true across all the ESTs, although it is impossible to know the final figure until the entire genome is sequenced.