Different data in this browser change at different rates. SNP, for example, are assumed to be stable between releases from dbSNP while MGI-curated data could change daily. Different tracks are, therefore, updated at different times. This file attempts to show when these updates occurred.
Creating or updating these tracks can be thought of as a two-part process: 1) A comprehensive as possible set of sequence coordinates needs to be collected or generated and 2) Appropriate annotation information needs to be added to the labels for these sequences on a track-by-track basis. We then use a drop-and-reload strategy for the entire GBrowse database.As also noted in the known issues, we have dropped the 'Chr' prefix in order for many of our tracks to be picked up at other resources using the DAS functionality in GBrowse. Work is in progress to get GBrowse to display 'Chr' even when that string is not explicitly listed in the data; searches or URLs that use the 'Chr' prefix will, however, not work and must be rephrased without the 'Chr.'
Tracks currently available via DAS are the MGI Representative transcripts, all allele tracks, all phenotype tracks, and the QTL track. The names and details for these tracks can be seen at this URL: http://gbrowse.informatics.jax.org/cgi-bin/das/mouse_current/types
Please note that the way we serve DAS tracks may change with an updated version of GBrowse anticipated for mid-end Sept
Coordinates for SNP and STS are written directly from the MGI database; this ensures any changes made to these data by MGI curation efforts will be reflected in the GBrowse track.
Coordinates for ENSEMBL and NCBI gene models were downloaded from the corresponding sites when they released their build 37 data. The MGI database reports MRK_ENSEMBL.rpt and MGI_Coordinate.rpt are then used to attach appropriate symbols to the gene models. VEGA genes are simply parsed from the VEGA data at this time, working from VEGA release 31.
QTL flanks are calculated from the known coordinates of the defining STS markers.
Coordinates for all mouse ESTs, mRNAs, and RefSeqs (NM_ sequences only) are downloaded from the UCSC FTP site and parsed to exclude all Chr_UN and Chr_Rnd coordinates. NR_, XR_, and XM_ RefSeqs are parsed from the MGI BLASTable database and BLATted against the assembly (a version that contains no Chr_UN and Chr_Rnd sequence). The complete sets of TIGR (now DFCI), NIA, and DoTS mouse sequences are also BLATed against the build 36 assembly. Hits from these BLATs are parsed using pslReps with the -singleHit option. (Note this may lead to subtly different filtering than seen in the coordinates from UCSC.)
A script uses the MGI database report SEQ_RepTransGenomic.rpt to build a list of all sequences that should appear in the MGI Representative Transcripts track and the labels (e.g. seqID and symbol) that should be attached to each. The script then searches in the pool of coordinates for the corresponding seqID, writing the coordinate data to the GFF file when found and writing the seqID to an error log if it is missing.
A second script reformats the entire set of GenBank mRNA coordinates to a GFF file, adding symbol information when the MGI database report, MRK_Sequence.rpt, contains any additional information for the sequence.
A final script generates all the Allele and Phenotype tracks. Presenting these data is a little more complex as a seqID that is representative of a gene can be associated with multiple alleles and each allele can in turn be associated with multiple Mammalian Phenotype (MP) terms. The MGI database report MGI_PhenotypicAllele.rpt is parsed for this information: A non-redundant list of seqIDs is built; each seqID is then associated with the complete list of alleles that correspond to that seqID; finally a non-redundant list of all MP terms that can be associated with that seqID is generated regardless of which allele or alleles that MP term is actually associated with. The complete pool of coordinates is then searched as above.
Coordinate data from MGI for micro RNAs were simply reformatted from one of out standard DB reports and displayed as the micro RNA track. Data from the VISTA enhancers site (just experimantally verified elements) were converted to mouse build 37 using the UCSC liftover utility as described in the citation for that track.
Coordinates:
NIA version mm8 (08/03/2006), DoTS release 10, and TIGR release 16 (07/27/2006) were BLATted against build 37 on 6/28/2008.
RefSeqAli.psl, all_est.psl and all_mrna.psl were downloaded from UCSC on 10/17/2008.
RefSeq_MR, RefSeq_XR, and RefSeq_XM were parsed from the MGI BLASTable database on 10/17/2008 and BLATted.
VEGA Gene Model data are from release 31.
NCBI Gene Models are from the 10/26/2007 release.
ENSEMBL Gene Models are from release 50.
STS data are from a 05/04/2006 release (with subsequent MGI curation and correction included up to 5/27/2008).
SNP data are from dbSNP ~Mar 2008).
Associations:
MGI database reports MRK_Sequence.rpt, MGI_PhenotypicAllele.rpt, SEQ_RepTransGenomic.rpt, MGI_Coordinate.rpt, and MRK_ENSEMBL.rpt were downloaded 10/17/2008 and should reflect MGI curated data as of that date.
Other data
QTL data were generated using the STS coordinates and list of QTLs in MGI on 5/29/2008. miRNA reflect micro RNAs in mGI as of 6/2/2008, and enhancer data reflecxt experimantally validated enhancers listed on the vista site as of 6/2/2008
Duplicate Hits
Even with careful filtering, some sequences have multiple matches in the mouse genome that are too close in quality to keep only the "best" hit. Previously these multiply-hitting sequences were not treated in any special way. In this build sequences that have multiple hits are flagged with an underscore and a number. Lack of a following underscore-number indicates the sequence has just one unique position, as in ID_123456, while the first copy of a multiply-hitting sequence will appear ID_567890_1. Subsequent copies of this sequence will have an incremented count (ID_567890_2, ID_567890_3 etc.). Often this seqID will be part of a more complex label, as in ID_567890_3_MGI:123456_Abc1. A search for ID_567890 should return multiple options for the different hit locations.
Exclusion List
We have a list of genes that are known to be in some mouse strains but not in the reference strain. As many processes do not distinguish non-BL6 genes, these sequences are usually included in the data and will usually have an acceptable match on the BL6 assembly. The labels that are displayed for these sequences are flagged with a "NOT_IN_BL6" comment. Here is the current list of known non-BL6 markers:
If you know of additional examples please let us know.