(5/5)

Perl Assignment

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Perl Assignment

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

Assignment 3:

File I/O and regular expressions

Re-examination

Instructions

This assignment consists of 5 tasks related to the lecture, compendium chapter and exercises

of the third course module. For each task, you should write a Perl program according to the

description in the question.

For each task you will get 0-4 points, as follows:

 0 points if the program is missing, or if it is completely wrong.

 1-2 pointsif the program is a reasonable attempt that basically (or almost) solves the task,

but with some minor errors or using badly chosen programming constructs.

 3 points if the program solves the task correctly, contains no errors and is written using

appropriate programming constructs.

 4 points if your program fulfills the criteria for 3 points and, in addition, contains

informative comments in the code that explain each statement that manipulates a file or

uses a regular expression.

The maximum total score for the assignment is 20 and you get a grade according to this scale:

 F: 0-10, E: 11-12, D: 13-14, C: 15-16, B: 17-18, A: 19-20

The assignment is solely individual work since it is part of the examination for the course. It is

not allowed to cooperate with other students or copy code from other sources when you

answer the questions. Any undue actions will be reported to the Disciplinary Committee.

In case you fail the re-examination, then you will need to wait until the next time the course

is given to get a new chance.

Your programs must be uploaded in the assignments section of the course site. You need to

make a compressed archive of the 5 Perl program files (where each file has the suffix .pl) and

upload the zip-file as your submission.

If anything is unclear, then send me an email at bjorn.olsson@his.se (in Swedish or English).

The assignment starts on the next page.

---------------------------------------------------------------------------------------------------------------------------

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

1. Write a program (task1.pl) that analyzes the content of a text file containing an entry from

the UniProt protein sequence database. You can find example UniProt entries on the

course site. For this task you can "hard code" the file name, meaning that the file name is

included in the code, rather than being given as input from the user. The example below

shows how it should work with the example entry COAA_ECO7I, but your program should

of course be general enough to work with any UniProt entry after changing only the name

of the input file in the code. The program should print out the following information to

the command prompt:

- The accession number found on the line tagged with "AC"

- The source organism for the sequence, which is found in the first of the lines tagged

with "OS"

- How many database references the entry contains, i.e. how many lines are tagged

with the keyword "DR"

(In the example run below, everything in red font is typed by the user, and everything in

black font is printed by the program.)

> perl task1.pl

> Accession number: B7NR61

> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).

> 28 database references

2. Write an extended version of task1.pl (call it task2.pl) that asks the user which files to be

analyzed. The program should then open the appropriate files and print the information

based on those files. An error message should be printed if one the files can not be found.

The program should take as an input parameter the number of files that will be analyzed.

Some UniProt files can be found on the course site, but you can also download additional

files from the UniProt site (uniprot.org).

> perl task2.pl 2

> Type file name for sequence 1: COAA_ECO7I.txt

> Accession number: B7NR61

> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).

> 28 database references

> Type file name for sequence 2: ARAT_999.fasta

> UniProt file not found. Terminating..

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

3. Write a program (task3.pl) that creates a PubMed bibliography file for a set of UniProt

entries. The program should read the names of UniProt entries from an input file called

entryNames.txt. For each UniProt entry, the Pubmed references can be found on lines

tagged with "RX". For example, in the file COAA_ECO7I.txt you can find the PubMed

reference number 19165319 in the RX line. In CAPSD_LSV.txt there are two RX lines, with

PubMed numbers 2129538 and 15828680, and in AL9A1_PONAB.txt there are none.

The example below shows the user dialog, the contents of entryNames.txt and the

resulting contents of the output file bibliography.txt.

> perl task3.pl

> 3 entries found in entryNames.txt

> 3 PubMed reference numbers written to bibliography.txt

Example contents of entryNames.txt:

COAA_ECO7I

CAPSD_LSV

AL9A1_PONAB

Example contents of bibliography.txt after running task3.pl:

19165319

2129538

15828680

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

4. Write a program (task4.pl) that can be used to generate a protein sequence file in Fasta

format from a UniProt text file. The Fasta header line should show the accession number

and sequence length. The lines following after the header line should show the amino

acid sequence without any blank spaces. For example, the relevant parts of

CAPSD_LSV.txt look like this (with red font indicating the parts that we need for the Fasta

file):

ID CAPSD_LSV Reviewed; 291 AA.

AC P27335;

DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot.

DT 01-AUG-1992, sequence version 1.

DT 18-SEP-2019, entry version 66.

DE RecName: Full=Capsid protein;

DE AltName: Full=Coat protein;

DE Short=CP;

SQ SEQUENCE 291 AA; 32041 MW; 57E289F3EA726388 CRC64;

MQSRPAQESG SASETPARGR PTPSDAPRDE PTNYNNNAES LLEQRLTRLI EKLNAEKHNS

NLRNVAFEIG RPSLEPTSAM RRNPANPYGR FSIDELFKMK VGVVSNNMAT TEQMAKIASD

IAGLGVPTEH VASVILQMVI MCACVSSSAF LDPEGSIEFE NGAVPVDSIA AIMKKHAGLR

KVCRLYAPIV WNSMLVRNQP PADWQAMGFQ YNTRFAAFDT FDYVTNQAAI QPVEGIIRRP

TSAEVIAHNA HKQLALDRSN RNERLGSLET EYTGGVQGAE IVRNHRYANN G

With the above UniProt text file as input, your program should generate a Fasta file with

the contents shown below. The file name should be the same as the accession number.

> P27335, 291 AA

MQSRPAQESGSASETPARGRPTPSDAPRDEPTNYNNNAESLLEQRLTRLIEKLNAEKHNS

NLRNVAFEIGRPSLEPTSAMRRNPANPYGRFSIDELFKMKVGVVSNNMATTEQMAKIASD

IAGLGVPTEHVASVILQMVIMCACVSSSAFLDPEGSIEFENGAVPVDSIAAIMKKHAGLR

KVCRLYAPIVWNSMLVRNQPPADWQAMGFQYNTRFAAFDTFDYVTNQAAIQPVEGIIRRP

TSAEVIAHNAHKQLALDRSNRNERLGSLETEYTGGVQGAEIVRNHRYANNG

The UniProt file name should be given by the user as input argument when starting the

program. An example run would look like this:

> perl task4.pl CAPSD_LSV.txt

> Sequence printed in Fasta format to file P27335.fasta

(Continue on the next page)

Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3

5. Write a program (task5.pl) that takes a protein identifier as input and uses regular

expressions to check the following information in the corresponding UniProt text file:

a. Was the sequence integrated into UniProt in the 1990s?

This would be true for CAPSD_LSV based on the first DT line:

DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot

b. Is the protein from a primate?

This would be true for AL9A1_PONAB based on one of the OC lines:

OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

c. Is the sequence length a three-digit number?

This would be true for COAA_ECO7I based on the SQ line:

SQ SEQUENCE 316 AA; 36360 MW; DDDC1922C5C52A70 CRC64;

Example run:

> perl task5.pl CAPSD_LSV

> Yes, integrated into UniProt in the 1990s

> No, is not from a primate

> Yes, sequence length between 100 and 999

---------------------------------------------------------------------------------------

End of assignment.

(5/5)

Use CA10RAM to get 10%* Discount.

Perl Assignment

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science

Other Services

Perl Assignment

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science