Rosalind习题:Open Reading Frames (题目ID:ORF)
这次的习题主要使用了几个之前题目用过的函数,比如反向互补序列、转录、翻译这些的。都是比较简单的。之前的一些简单的习题我都没有单独写博客,这次正好一道题用到好几个前面写过的程序,正好都整理成函数放到这个代码里。
Problem
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.
An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.
Given: A DNA string s of length at most 1 kbp in FASTA format.
Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.
Sample Dataset
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
Sample Output
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
我的代码
这次我的代码都比较简单容易懂,没啥好解释的,一看就看懂了,都是很简单的功能~
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
''' Rosalind Problems: [ORF] Open Reading Frames ''' def rev_comp(fa): ''' Give transript and reversed complementary transcript ''' fa_transcript = '' for i in range(len(fa)): index = -(i+1) if fa[index] == 'A': fa_transcript = fa_transcript + 'U' elif fa[index] == 'T': fa_transcript = fa_transcript + 'A' elif fa[index] == 'C': fa_transcript = fa_transcript + 'G' else: fa_transcript = fa_transcript +'C' fa_rc = '' for i in range(len(fa_transcript)): index = -(i+1) if fa_transcript[index] == 'A': fa_rc = fa_rc + 'U' elif fa_transcript[index] == 'U': fa_rc = fa_rc + 'A' elif fa_transcript[index] == 'C': fa_rc = fa_rc + 'G' else: fa_rc = fa_rc +'C' return fa_transcript, fa_rc def find_all(s,substring): ''' this function is for find all substrings in one string. It returns the index(es) of the start of all substring(s). ''' index_list = [] index = s.find(substring) while index != -1: #find() returns -1 if there is no match. index_list.append(index) index = s.find(substring, index+1) #mimic the return rule of find() if len(index_list) > 0: return index_list else: return -1 def orf(mrna): #finding = find_all(mrna, 'AUG') #print(finding) start_codon = 'AUG' stop_codon = ['UAA', 'UAG', 'UGA'] i, j = 0,0 out = [] while i <= len(mrna)-2: if mrna[i:i+3] == start_codon: j=i sequence='' while i<= len(mrna) -2: if mrna[i:i+3] in stop_codon: out.append(sequence) break sequence = sequence + mrna[i:i+3] i = i+3 i = j+1 j = j+1 #print(out) return out def translate(rnaseq): codon_table = { 'UUU': 'F', 'CUU': 'L', 'AUU': 'I', 'GUU': 'V', \ 'UUC': 'F', 'CUC': 'L', 'AUC': 'I', 'GUC': 'V', \ 'UUA': 'L', 'CUA': 'L', 'AUA': 'I', 'GUA': 'V', \ 'UUG': 'L', 'CUG': 'L', 'AUG': 'M', 'GUG': 'V', \ 'UCU': 'S', 'CCU': 'P', 'ACU': 'T', 'GCU': 'A', \ 'UCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A', \ 'UCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', \ 'UCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A', \ 'UAU': 'Y', 'CAU': 'H', 'AAU': 'N', 'GAU': 'D', \ 'UAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D', \ 'UAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', \ 'UAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E', \ 'UGU': 'C', 'CGU': 'R', 'AGU': 'S', 'GGU': 'G', \ 'UGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G', \ 'UGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G', \ 'UGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'} length = len(rnaseq) proseq = [] for i in range(0,length,3): triplet = rnaseq[i:i+3] if codon_table[str(triplet)] != 'Stop': proseq.append(codon_table[str(triplet)]) else: break proseq = ''.join(proseq) return proseq dna = 'TATACATCACTCCAGGCATCAGAAAATCATGAGAAAGTCTGTGCGCGTAGCGAGAAGGTAGGCTCATTTGTTACCCTTGGACAACTACTGCCGCGTCTGGGCCTCCAAATCGGCTGGTCTTTTTCAGCTCCGTCTTAGGTATCGCGAAATGGACGGGAGGACCATAACTTACCTCCTCTTCTTTTGGCAGTCAGGCTATGACCACGTTTTGTCGGTTACAGATCACCTACCGCGGCGTAACACTGGTGCATATAGCTTGGTTGGGTTGCCTCTCCGCCTTCTCTGACTGGCGAGTGTACGGTAGGAACGCCGGTTCAATTGCATGCTCTGACCTTCTCAGGTAGAATTTCCAGACGAGTTGACAGACTCATCGTTACGCGGGCGGCGGTTCCAAAGCTCCTTACTAGAGATAGACAAGCGCCTAAATGGTTGCTTCCCGAGACGTTCATTAGCTAATGAACGTCTCGGGAAGCAACCATCATATCGATCCCGTGAATCCCTGCCCGTATGCCCCACAGGATAAGGATACACCAGTGACTGAACCTCTGCAATAGTCAGAGATCAGGGTGCTCTTTCATAGCTAATAGCTAGGCCGCGTACTTTAAGTTGTAACACTAACTGCTATGTGGTGAGCTTGAACGCGCGAAGCTGCCCCACAAGATGAAATATGGCCTTCGGAAAGATCACATTCTTGACCTCTGGGGTGTCACTTAAAATTGGCGAAGGTCGGAAAACTCTTTCTATTGCCCGCAAGGCTAAATGGTTCCAACCCCGATGTGTATTTCTCAAACTTTTCAGGTTTTTCTGAGTTACGAACAAGGGCTCGAGCGTGGGAATAGTTTAAATGAACTGTAGATTGAAGTATCGCAAGGAGGAAGTATTCTCTATCAGACGCTTGGTCACG' mrna1, mrna2 = rev_comp(dna) #print(mrna1, mrna2) orf_list1=orf(mrna1) orf_list2=orf(mrna2) orf_list = orf_list1 for i in orf_list2: if i not in orf_list: orf_list.append(i) #print(orf_list) for i in orf_list: print(translate(i)) |
输入的序列
TATACATCACTCCAGGCATCAGAAAATCATGAGAAAGTCTGTGCGCGTAGCGAGAAGGTAGGCTCATTTGTTACCCTTGGACAACTACTGCCGCGTCTGGGCCTCCAAATCGGCTGGTCTTTTTCAGCTCCGTCTTAGGTATCGCGAAATGGACGGGAGGACCATAACTTACCTCCTCTTCTTTTGGCAGTCAGGCTATGACCACGTTTTGTCGGTTACAGATCACCTACCGCGGCGTAACACTGGTGCATATAGCTTGGTTGGGTTGCCTCTCCGCCTTCTCTGACTGGCGAGTGTACGGTAGGAACGCCGGTTCAATTGCATGCTCTGACCTTCTCAGGTAGAATTTCCAGACGAGTTGACAGACTCATCGTTACGCGGGCGGCGGTTCCAAAGCTCCTTACTAGAGATAGACAAGCGCCTAAATGGTTGCTTCCCGAGACGTTCATTAGCTAATGAACGTCTCGGGAAGCAACCATCATATCGATCCCGTGAATCCCTGCCCGTATGCCCCACAGGATAAGGATACACCAGTGACTGAACCTCTGCAATAGTCAGAGATCAGGGTGCTCTTTCATAGCTAATAGCTAGGCCGCGTACTTTAAGTTGTAACACTAACTGCTATGTGGTGAGCTTGAACGCGCGAAGCTGCCCCACAAGATGAAATATGGCCTTCGGAAAGATCACATTCTTGACCTCTGGGGTGTCACTTAAAATTGGCGAAGGTCGGAAAACTCTTTCTATTGCCCGCAAGGCTAAATGGTTCCAACCCCGATGTGTATTTCTCAAACTTTTCAGGTTTTTCTGAGTTACGAACAAGGGCTCGAGCGTGGGAATAGTTTAAATGAACTGTAGATTGAAGTATCGCAAGGAGGAAGTATTCTCTATCAGACGCTTGGTCACG
我的输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
M MKEHPDL MMVASRDVH MVASRDVH MNVSGSNHLGACLSLVRSFGTAARVTMSLSTRLEILPEKVRACN MSLSTRLEILPEKVRACN MQLNRRSYRTLASQRRRRGNPTKLYAPVLRRGR MHQCYAAVGDL MVLPSISRYLRRS MIF MRKSVRVARR MDGRTITYLLFFWQSGYDHVLSVTDHLPRRNTGAYSLVGLPLRLL MTTFCRLQITYRGVTLVHIAWLGCLSAFSDWRVYGRNAGSIACSDLLR ML MNVSGSNHHIDPVNPCPYAPQDKDTPVTEPLQ MPHRIRIHQ MW MKYGLRKDHILDLWGVT MAFGKITFLTSGVSLKIGEGRKTLSIARKAKWFQPRCVFLKLFRFF MVPTPMCISQTFQVFLSYEQGLERGNSLNEL MCISQTFQVFLSYEQGLERGNSLNEL |