logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

Perl Assignment

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Perl Assignment

 

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

Assignment 3:

File I/O and regular expressions

Re-examination

Instructions

This assignment consists of 5 tasks related to the lecture, compendium chapter and exercises

of the third course module. For each task, you should write a Perl program according to the

description in the question.

For each task you will get 0-4 points, as follows:

 0 points if the program is missing, or if it is completely wrong.

 1-2 pointsif the program is a reasonable attempt that basically (or almost) solves the task,

but with some minor errors or using badly chosen programming constructs.

 3 points if the program solves the task correctly, contains no errors and is written using

appropriate programming constructs.

 4 points if your program fulfills the criteria for 3 points and, in addition, contains

informative comments in the code that explain each statement that manipulates a file or

uses a regular expression.

The maximum total score for the assignment is 20 and you get a grade according to this scale:

 F: 0-10, E: 11-12, D: 13-14, C: 15-16, B: 17-18, A: 19-20

The assignment is solely individual work since it is part of the examination for the course. It is

not allowed to cooperate with other students or copy code from other sources when you

answer the questions. Any undue actions will be reported to the Disciplinary Committee.

In case you fail the re-examination, then you will need to wait until the next time the course

is given to get a new chance.

Your programs must be uploaded in the assignments section of the course site. You need to

make a compressed archive of the 5 Perl program files (where each file has the suffix .pl) and

upload the zip-file as your submission.

If anything is unclear, then send me an email at bjorn.olsson@his.se (in Swedish or English).

The assignment starts on the next page.

---------------------------------------------------------------------------------------------------------------------------

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

1. Write a program (task1.pl) that analyzes the content of a text file containing an entry from

the UniProt protein sequence database. You can find example UniProt entries on the

course site. For this task you can "hard code" the file name, meaning that the file name is

included in the code, rather than being given as input from the user. The example below

shows how it should work with the example entry COAA_ECO7I, but your program should

of course be general enough to work with any UniProt entry after changing only the name

of the input file in the code. The program should print out the following information to

the command prompt:

- The accession number found on the line tagged with "AC"

- The source organism for the sequence, which is found in the first of the lines tagged

with "OS"

- How many database references the entry contains, i.e. how many lines are tagged

with the keyword "DR"

(In the example run below, everything in red font is typed by the user, and everything in

black font is printed by the program.)

> perl task1.pl

> Accession number: B7NR61

> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).

> 28 database references

2. Write an extended version of task1.pl (call it task2.pl) that asks the user which files to be

analyzed. The program should then open the appropriate files and print the information

based on those files. An error message should be printed if one the files can not be found.

The program should take as an input parameter the number of files that will be analyzed.

Some UniProt files can be found on the course site, but you can also download additional

files from the UniProt site (uniprot.org).

> perl task2.pl 2

> Type file name for sequence 1: COAA_ECO7I.txt

> Accession number: B7NR61

> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).

> 28 database references

> Type file name for sequence 2: ARAT_999.fasta

> UniProt file not found. Terminating..

>

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

3. Write a program (task3.pl) that creates a PubMed bibliography file for a set of UniProt

entries. The program should read the names of UniProt entries from an input file called

entryNames.txt. For each UniProt entry, the Pubmed references can be found on lines

tagged with "RX". For example, in the file COAA_ECO7I.txt you can find the PubMed

reference number 19165319 in the RX line. In CAPSD_LSV.txt there are two RX lines, with

PubMed numbers 2129538 and 15828680, and in AL9A1_PONAB.txt there are none.

The example below shows the user dialog, the contents of entryNames.txt and the

resulting contents of the output file bibliography.txt.

> perl task3.pl

> 3 entries found in entryNames.txt

> 3 PubMed reference numbers written to bibliography.txt

Example contents of entryNames.txt:

COAA_ECO7I

CAPSD_LSV

AL9A1_PONAB

Example contents of bibliography.txt after running task3.pl:

19165319

2129538

15828680

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

4. Write a program (task4.pl) that can be used to generate a protein sequence file in Fasta

format from a UniProt text file. The Fasta header line should show the accession number

and sequence length. The lines following after the header line should show the amino

acid sequence without any blank spaces. For example, the relevant parts of

CAPSD_LSV.txt look like this (with red font indicating the parts that we need for the Fasta

file):

ID CAPSD_LSV Reviewed; 291 AA.

AC P27335;

DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot.

DT 01-AUG-1992, sequence version 1.

DT 18-SEP-2019, entry version 66.

DE RecName: Full=Capsid protein;

DE AltName: Full=Coat protein;

DE Short=CP;

..

..

SQ SEQUENCE 291 AA; 32041 MW; 57E289F3EA726388 CRC64;

 MQSRPAQESG SASETPARGR PTPSDAPRDE PTNYNNNAES LLEQRLTRLI EKLNAEKHNS

 NLRNVAFEIG RPSLEPTSAM RRNPANPYGR FSIDELFKMK VGVVSNNMAT TEQMAKIASD

 IAGLGVPTEH VASVILQMVI MCACVSSSAF LDPEGSIEFE NGAVPVDSIA AIMKKHAGLR

 KVCRLYAPIV WNSMLVRNQP PADWQAMGFQ YNTRFAAFDT FDYVTNQAAI QPVEGIIRRP

 TSAEVIAHNA HKQLALDRSN RNERLGSLET EYTGGVQGAE IVRNHRYANN G

With the above UniProt text file as input, your program should generate a Fasta file with

the contents shown below. The file name should be the same as the accession number.

> P27335, 291 AA

 MQSRPAQESGSASETPARGRPTPSDAPRDEPTNYNNNAESLLEQRLTRLIEKLNAEKHNS

 NLRNVAFEIGRPSLEPTSAMRRNPANPYGRFSIDELFKMKVGVVSNNMATTEQMAKIASD

 IAGLGVPTEHVASVILQMVIMCACVSSSAFLDPEGSIEFENGAVPVDSIAAIMKKHAGLR

 KVCRLYAPIVWNSMLVRNQPPADWQAMGFQYNTRFAAFDTFDYVTNQAAIQPVEGIIRRP

 TSAEVIAHNAHKQLALDRSNRNERLGSLETEYTGGVQGAEIVRNHRYANNG

The UniProt file name should be given by the user as input argument when starting the

program. An example run would look like this:

> perl task4.pl CAPSD_LSV.txt

> Sequence printed in Fasta format to file P27335.fasta

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

5. Write a program (task5.pl) that takes a protein identifier as input and uses regular

expressions to check the following information in the corresponding UniProt text file:

a. Was the sequence integrated into UniProt in the 1990s?

This would be true for CAPSD_LSV based on the first DT line:

DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot

b. Is the protein from a primate?

This would be true for AL9A1_PONAB based on one of the OC lines:

OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

c. Is the sequence length a three-digit number?

This would be true for COAA_ECO7I based on the SQ line:

SQ SEQUENCE 316 AA; 36360 MW; DDDC1922C5C52A70 CRC64;

Example run:

> perl task5.pl CAPSD_LSV

> Yes, integrated into UniProt in the 1990s

> No, is not from a primate

> Yes, sequence length between 100 and 999

---------------------------------------------------------------------------------------

End of assignment. 

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

695 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

750 Answers

Hire Me
expert
Husnain SaeedComputer science

931 Answers

Hire Me
expert
Atharva PatilComputer science

666 Answers

Hire Me