PSSD Genome Project [Megathread]

This is a place to post research you have done on the topic along with your conclusions.
sovietxrobot
Posts: 286
Joined: Tue Oct 03, 2017 12:04 pm
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by sovietxrobot »

it looks like you are looking at single SNP-level data rather than gene-level. If you are familiar with R, you can get the minor allele frequencies for all of the SNPs in the 1000 Genomes project with the package "MafDb.1Kgenomes.phase3.hs37d5".
User avatar
Ghost
Posts: 1750
Joined: Wed Jan 21, 2015 11:16 pm
Location: USA
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Ghost »

Good advice. I think they have good resources.

R scares me but I have taken a class in it before.

The 1000 genomes looks to be the best option I just need to figure out how to clean the data to be what I want.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it :)
sovietxrobot
Posts: 286
Joined: Tue Oct 03, 2017 12:04 pm
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by sovietxrobot »

I can send you a script to fetch that data and then write it to text, so you can import into python. When you say clean the data to be what you- can you be more specific?
User avatar
Ghost
Posts: 1750
Joined: Wed Jan 21, 2015 11:16 pm
Location: USA
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Ghost »

That would be great. So far I've been messing with scikit-allel.

Code: Select all

# import scikit-allel
import allel
# check which version is installed
print(allel.__version__)

import os
os.chdir(r"C:\Users\Ghost\Desktop\Genome")

df = allel.vcf_to_dataframe('ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf', fields=['ID','ALT_1','ALT_2','AF_1','AF_2'], alt_number=2)
results = open('results1000.txt','w')
results.write(str(df))
results.close()
df.compress
print(df)
It's worked in that I'm printing out the following:

Code: Select all

               ID  ALT_1  ALT_2  AF_1  AF_2
0      rs11575897    NaN    NaN   NaN   NaN
1               .    NaN    NaN   NaN   NaN
2               .    NaN    NaN   NaN   NaN
3               .    NaN    NaN   NaN   NaN
4               .    NaN    NaN   NaN   NaN
...           ...    ...    ...   ...   ...
62037           .    NaN    NaN   NaN   NaN
62038           .    NaN    NaN   NaN   NaN
62039           .    NaN    NaN   NaN   NaN
62040           .    NaN    NaN   NaN   NaN
62041           .    NaN    NaN   NaN   NaN

[62042 rows x 5 columns]
However, I can't get it all to a text file - it keeps getting cut too short.

Any ideas you have would be appreciated.

By "Cleaning" I just mean getting it into a format that we can use to compare it against what we have already for the PSSD user data. This is hard because I really don't have much idea of what I'm looking for except "increase control sample size and turn it into a text file". The data from 23andme looks like this:

Code: Select all

# rsid	chromosome	position	genotype
rs4477212	1	72017	AA
rs3094315	1	742429	AG
rs3131972	1	742584	AG
rs12124819	1	766409	AA
rs11240777	1	788822	GG
rs6681049	1	789870	CC
rs4970383	1	828418	AA
rs4475691	1	836671	TT
@Cipro really helped with the original version of the program, but basically we need to determine how common an allele is for the PSSD group, and compare it to how common it is in this huge control group. "Cleaning" would also be reducing the number of SNPs we're looking at. For example - Most SNPs are useless because they have no research, or they don't overlap between the databases. Basically, there is a file called "snp.txt" - it's a list of all of the SNPs that I'm interested in or that SNPedia has pages on. I went through last year through 500+ pages and copied/pasted all that data into a .txt file. I guess we don't really need much "cleaning" here because the program checks if the file has the SNP before doing the frequency calculations, but I still think we waste time by making it look through a list of mostly useless SNPs each time it does so. The ideal program would take the document created from the 1000 genomes project and cut it to a list of only data for the SNPs we cared about. Ditto for the PSSD genomes.

So overview:
- Takes all data from some format (either 23+me or VCF/1000 genes) and turns it into 1 type of .txt file with the following information: rs#, Allele frequency for AGCT, and sample size (determined when the program is launched - we don't want data from SNPs that we only have 1 of - leads to a lot of 100% A for instance and that leads to false positives).
- Then, we have 2 text files. PSSD and Control. Each line has 6 points of data(rs/(A/G/C/T)/sample size). They have more SNPs than we care about because not all of them are useful. Therefore, we run it through the 'snp.txt' list and cut out the ones on each list that aren't on there. Now we have SNPs on each list that are known or at least in SNPedia database.
- Cut out the SNPs with small sample size. If only 1 PSSD patient got tested for rs1234, then we don't care about it. It's gotta be something like over 20. But we can set that number at the start like we did in the original program.
- Now, we finally have 2 lists of SNPs that we care about and that are cut so only those with sizeable sample sizes exist. NOW, we can run the program comparing the differences between them, printing the results as they come.
---
I have some lost scripts on my old computer that turn the results into visuals. I will fish those out but eventually I will add those to the end so that for each SNP that we find there is basically a printed out report that is generated with graphs. At this time I don't think I need help on this.

Phew...Sorry that was a lot.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it :)
sovietxrobot
Posts: 286
Joined: Tue Oct 03, 2017 12:04 pm
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by sovietxrobot »

you can convert 23andMe to vcf format using bcftools: https://www.biostars.org/p/374149/

vcftools will allow you to extract a list of snps very efficiently.

ultimately, you want a matrix with subjects as rows, snps as columns, and cells as genotypes. genotypes are coded as 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for minor allele).

data QC depends on how many subjects you have. you are correct that you should exclude subjects or snps with too few observations.

Im not sure why the text gets cut short, maybe the file is too big? you could split by chromosome.

do you have actual control subjects, or are you comparing against the population MAF as estimated from 1000 genomes? because that determines what kind of statistical testing you can do.
User avatar
Dubya_B
Posts: 49
Joined: Sat Apr 30, 2016 3:33 pm
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Dubya_B »

For example - Most SNPs are useless because they have no research,
I fear you may pass-up a "PSSD SNP" if you go about your project this way. Even if there is no research on a particular SNP, it would be associated with other possibly related genes through its location. Not sure if this is the correct term, but there may be linkage disequilibrium between that SNP and other SNPs that are often inherited with it as a set.

To save time and frustration, the PLINK documentation recommended pre-processing to remove SNPs with only common variants. ...Basically, many SNPs don't say much more than "I am a human," and those will not be of much value when researching a rare disease.

ps- Good to hear this is still going forward.
User avatar
Ghost
Posts: 1750
Joined: Wed Jan 21, 2015 11:16 pm
Location: USA
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Ghost »

sovietxrobot wrote:you can convert 23andMe to vcf format using bcftools: https://www.biostars.org/p/374149/

vcftools will allow you to extract a list of snps very efficiently.

ultimately, you want a matrix with subjects as rows, snps as columns, and cells as genotypes. genotypes are coded as 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for minor allele).

data QC depends on how many subjects you have. you are correct that you should exclude subjects or snps with too few observations.

Im not sure why the text gets cut short, maybe the file is too big? you could split by chromosome.

do you have actual control subjects, or are you comparing against the population MAF as estimated from 1000 genomes? because that determines what kind of statistical testing you can do.
I think it has to do with the different datastructures. The way I understand it, Numpy and Pandas set their arrays differently. Scikit-allel uses Numpy I think? There is a setting somewhere to set the truncation level but I don't know how to do that.

I'm comparing the population from 1000 genes.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it :)
User avatar
Ghost
Posts: 1750
Joined: Wed Jan 21, 2015 11:16 pm
Location: USA
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Ghost »

Dubya_B wrote:
For example - Most SNPs are useless because they have no research,
I fear you may pass-up a "PSSD SNP" if you go about your project this way. Even if there is no research on a particular SNP, it would be associated with other possibly related genes through its location. Not sure if this is the correct term, but there may be linkage disequilibrium between that SNP and other SNPs that are often inherited with it as a set.

To save time and frustration, the PLINK documentation recommended pre-processing to remove SNPs with only common variants. ...Basically, many SNPs don't say much more than "I am a human," and those will not be of much value when researching a rare disease.

ps- Good to hear this is still going forward.
Yea that's a worry that I do have. I just know that from experience before I struggled with finding SNPs and having no idea what to do with them because nothing is known.

It might be worth it to use PLINK or other GWAS tools to do this work. It definitely lets me do more things. It just gets confusing at some point as my genetics knowledge is limited compared to the PhD's doing most of the work in the field.

Are you aware if a list of those SNPs exists for download? Especially one that I could use to amend my list of those I'm interested in?
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it :)
sovietxrobot
Posts: 286
Joined: Tue Oct 03, 2017 12:04 pm
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by sovietxrobot »

I am not sure what is meant by removing common variants- common variants can contribute to disease risk. This isn't a step I have ever taken, but it could be that the genotyping platform already ignores these variants.

The typical process for genomics research is to get a sample set genotyped on some platform (usually ~300k SNPs). These data is reduced to only well-sampled variants, and then imputed against a reference genome. Essentially, you use reference data and properties of genetics to interpolate SNPs you haven't directly observed. After QC of the imputed data, you are left with something in the ballpark of millions of SNPs.

Therein lies the problem- the sample size for this project is going to be tiny. A genome-wide approach looking to discover new risk SNPs is not realistic. Analyzing a smaller set of SNPs that you already have a hypothesis on is much more tractable.
User avatar
Ghost
Posts: 1750
Joined: Wed Jan 21, 2015 11:16 pm
Location: USA
Contact:

Re: PSSD Genome Project [Megathread]

Unread post by Ghost »

sovietxrobot wrote:I am not sure what is meant by removing common variants- common variants can contribute to disease risk. This isn't a step I have ever taken, but it could be that the genotyping platform already ignores these variants.

The typical process for genomics research is to get a sample set genotyped on some platform (usually ~300k SNPs). These data is reduced to only well-sampled variants, and then imputed against a reference genome. Essentially, you use reference data and properties of genetics to interpolate SNPs you haven't directly observed. After QC of the imputed data, you are left with something in the ballpark of millions of SNPs.

Therein lies the problem- the sample size for this project is going to be tiny. A genome-wide approach looking to discover new risk SNPs is not realistic. Analyzing a smaller set of SNPs that you already have a hypothesis on is much more tractable.
Good points.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it :)
Post Reply

Who is online

Users browsing this forum: No registered users and 9 guests