GeMSTONE about

GeMSTONE is a project maintained by the Haiyuan Yu Lab at Cornell University. For any questions or maintenance please contact Juan Felipe Beltran at

Getting Started …

To get started with GeMSTONE, please prepare your variant file in a VCF (Variant Call Format) – this is the only mandatory file input for GeMSTONE to run! If you have a control dataset, you can upload it via the second entry. If you want to do familial (co-segregation) analysis, you can upload your pedigree file via the third entry. Inheritance model can be chosen under PEDIGREE tab coming later.

Variant quality control filters are available on both site and genotype levels. In particular, the variant allele frequency (AF) filter can be specified to per-sample level; corresponding ethnicity-specific AF database can be specified in pedigree file using “database_subpopulation” code (see specifics on .ped format customization). If not specified, overall AF in ExAC database will be used.

One of the six inheritance models can be selected to perform co-segregation analysis for familial cases. If your samples are all sporadic, “No Inheritance” can be chosen to parse the variant file. Two recurrence filters can be applied – the basic idea is that we use a lower-bound to select variants that affect certain least amount of samples and an upper-bound to exclude variants that are likely to be sequencing artifact: “Multiple Occurrence Across All Samples” considers all affected samples as independent and requires the variant to be shared by the number of samples specified in the range to pass this filter; “Multiple Occurrence Across Families” considers samples in the same family as one unit and counted as one. The latter filter will use results from co-segregation analysis, i.e., in each family only co-segregating variants will be counted. In particular, if you are specifically interested in the recurrence of co-segregating variants, the checkbox “Variants Must Occur In At Least 1 Non-Sporadic Family” will look for variants that are co-segregating in at least one family and have multiple occurrences in other samples or families. Variants in each sample / co-segregating variants in each family will be written to separate result files and recurrent variants will be written to another result file.

Since allele frequency plays a significant role in disease-associated variant prioritization and can be very different among sub-populations, besides AF filtering GeMSTONE further provides annotations from four public AF databases. Please note that selections here are only for annotation but not for filtering; AF filters will be applied as specified under “VCF FILTERS” tab.

“Variant Consequence” and “Transcript Biotype” of interest can be selected here.

23 different algorithms predicting variant function are available here! For each predictor, a range could be specified; a global deleteriousness filter is also available, which sets a threshold on the number of selected deleteriousness predictors needed in order for a variant to pass the filter. Output of each predictor selected as well as the number of “Deleteriousness Filter” for the variants will be annotated in the result table.

Gene level annotations from Gene Ontology and disease-associated databases; user-defined gene list of interest can be uploaded. Protein level annotations on protein domain and protein-protein interactions. Most options in GENE ANNOTATION (1) and (2) can serve a double purpose as either filters or simply as annotations by (un)checking the box of “Filter Out Genes…” following the selection of annotations. GeMSTONE also provides the option to combine information across libraries – for instance, checking “Add GO Annotation For Interaction Partners” box will annotate the gene’s interaction partners that have the GO terms selected from the dropdown list.

Extended annotations from pathway databases and pathway enrichment analysis on candidate genes using a fisher exact test.

Protein expression profile annotations.

GDI and RVIS annotations. Gene burden tests can be performed (using PLINKSEQ tool) if a control VCF file is provided. Please note that gene burden test can be the speed limiting step of the workflow and could significantly extend analyzing time.

Running a sample test …
1. Download sample inputs available under each entry;
2. Enter a couple thresholds or annotations or leave everything as default;
3. Submit!

Replicating a sample test …
1. Upload recipe file together with the inputs from previous sample test;
2. Submit!

Interpreting the result tables …
Two types of results table are generated from GeMSTONE: 1) variant tables started with chromosome number and chromosomal position followed by variant level annotations (e.g. allele frequencies, variant function prediction scores etc.), importantly, sample IDs of carriers are in column 7 – for each sample or family a separate variant table is generated named with the corresponding sample ID, and a combined variant table is generated named with “Cross-sample_all_variants.txt”, in this case column 7 will contain all sample IDs that carry this variant and is separated by square brackets, i.e., multiple samples in the same square bracket are from the same family; 2) gene table started with gene name followed by gene level annotations (e.g. GO terms, interaction partners etc.).

You can download sample files from the menu below or directly from the GeMSTONE site to test the system without any data of your own. Recipe files can be uploaded unto the site to pre-set parameters from a previous run.