LD-based SNP selection

Introduction

A few people asked me the best way to identify SNPs in a particular array (for example, Illumina array with 550K markers), that are best proxy for a list of candidate SNPs (for example, dozens of SNPs reported in a meta-analysis paper using imputation data). The find_ld_snp.pl program is designed to address this issue. It finds the best proxy markers for a given list of candidate markers based on linkage disequilibrium from a GWAS data set (in PLINK format).

Usage

The program requires that PLINK be installed in the system first, since it calls PLINK for LD calculation. Typing the command without arguments will print out a simple help message.

[kaiwang@biocluster ~/]$ find_ld_snp.pl 
Usage:
     find_ld_snp.pl [arguments] <query-SNP-list-file> <candidate-SNP-list-file> <PLINK-binary-prefix>

     Optional arguments:
            -v, --verbose                   use verbose output
            -h, --help                      print help message
            -m, --man                       print complete documentation

     Function: find SNPs in candidate list that are best proxy for SNPs in query list

     Example: find_ld_snp.pl querylist humanhap550.snplist hapmap_ceu_r23a

It takes three input files: a query SNP list file which has one SNP per line, a larger candidate SNP list file file containing all candidate SNPs to be selected, and a PLINK file prefix. For example, if you have hapmap_CEU_r23a.bim, hapmap_CEU_r23a.fam, hapmap_CEU_r23a.bed file, you just need to specify hapmap_CEU_r23a.

Now issue the command:

[kaiwang@cc ~/]$ find_ld_snp.pl speliotes.snplist snplist.illumina ~/lib/hapmap/plink1/hapmap_CEU_r23a

it will tell what are the best SNPs to use in the Illumina array, their r2 measures, and their distance to each other. Use the "-m" argument to read the manual for more information.