Perl Assignment
Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3
Assignment 3:
File I/O and regular expressions
Re-examination
Instructions
This assignment consists of 5 tasks related to the lecture, compendium chapter and exercises
of the third course module. For each task, you should write a Perl program according to the
description in the question.
For each task you will get 0-4 points, as follows:
0 points if the program is missing, or if it is completely wrong.
1-2 pointsif the program is a reasonable attempt that basically (or almost) solves the task,
but with some minor errors or using badly chosen programming constructs.
3 points if the program solves the task correctly, contains no errors and is written using
appropriate programming constructs.
4 points if your program fulfills the criteria for 3 points and, in addition, contains
informative comments in the code that explain each statement that manipulates a file or
uses a regular expression.
The maximum total score for the assignment is 20 and you get a grade according to this scale:
F: 0-10, E: 11-12, D: 13-14, C: 15-16, B: 17-18, A: 19-20
The assignment is solely individual work since it is part of the examination for the course. It is
not allowed to cooperate with other students or copy code from other sources when you
answer the questions. Any undue actions will be reported to the Disciplinary Committee.
In case you fail the re-examination, then you will need to wait until the next time the course
is given to get a new chance.
Your programs must be uploaded in the assignments section of the course site. You need to
make a compressed archive of the 5 Perl program files (where each file has the suffix .pl) and
upload the zip-file as your submission.
If anything is unclear, then send me an email at bjorn.olsson@his.se (in Swedish or English).
The assignment starts on the next page.
---------------------------------------------------------------------------------------------------------------------------
Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3
1. Write a program (task1.pl) that analyzes the content of a text file containing an entry from
the UniProt protein sequence database. You can find example UniProt entries on the
course site. For this task you can "hard code" the file name, meaning that the file name is
included in the code, rather than being given as input from the user. The example below
shows how it should work with the example entry COAA_ECO7I, but your program should
of course be general enough to work with any UniProt entry after changing only the name
of the input file in the code. The program should print out the following information to
the command prompt:
- The accession number found on the line tagged with "AC"
- The source organism for the sequence, which is found in the first of the lines tagged
with "OS"
- How many database references the entry contains, i.e. how many lines are tagged
with the keyword "DR"
(In the example run below, everything in red font is typed by the user, and everything in
black font is printed by the program.)
> perl task1.pl
> Accession number: B7NR61
> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).
> 28 database references
2. Write an extended version of task1.pl (call it task2.pl) that asks the user which files to be
analyzed. The program should then open the appropriate files and print the information
based on those files. An error message should be printed if one the files can not be found.
The program should take as an input parameter the number of files that will be analyzed.
Some UniProt files can be found on the course site, but you can also download additional
files from the UniProt site (uniprot.org).
> perl task2.pl 2
> Type file name for sequence 1: COAA_ECO7I.txt
> Accession number: B7NR61
> Source organism: Escherichia coli O7:K1 (strain IAI39 / ExPEC).
> 28 database references
> Type file name for sequence 2: ARAT_999.fasta
> UniProt file not found. Terminating..
>
(Continue on the next page)
Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3
3. Write a program (task3.pl) that creates a PubMed bibliography file for a set of UniProt
entries. The program should read the names of UniProt entries from an input file called
entryNames.txt. For each UniProt entry, the Pubmed references can be found on lines
tagged with "RX". For example, in the file COAA_ECO7I.txt you can find the PubMed
reference number 19165319 in the RX line. In CAPSD_LSV.txt there are two RX lines, with
PubMed numbers 2129538 and 15828680, and in AL9A1_PONAB.txt there are none.
The example below shows the user dialog, the contents of entryNames.txt and the
resulting contents of the output file bibliography.txt.
> perl task3.pl
> 3 entries found in entryNames.txt
> 3 PubMed reference numbers written to bibliography.txt
Example contents of entryNames.txt:
COAA_ECO7I
CAPSD_LSV
AL9A1_PONAB
Example contents of bibliography.txt after running task3.pl:
19165319
2129538
15828680
(Continue on the next page)
Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3
4. Write a program (task4.pl) that can be used to generate a protein sequence file in Fasta
format from a UniProt text file. The Fasta header line should show the accession number
and sequence length. The lines following after the header line should show the amino
acid sequence without any blank spaces. For example, the relevant parts of
CAPSD_LSV.txt look like this (with red font indicating the parts that we need for the Fasta
file):
ID CAPSD_LSV Reviewed; 291 AA.
AC P27335;
DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot.
DT 01-AUG-1992, sequence version 1.
DT 18-SEP-2019, entry version 66.
DE RecName: Full=Capsid protein;
DE AltName: Full=Coat protein;
DE Short=CP;
..
..
SQ SEQUENCE 291 AA; 32041 MW; 57E289F3EA726388 CRC64;
MQSRPAQESG SASETPARGR PTPSDAPRDE PTNYNNNAES LLEQRLTRLI EKLNAEKHNS
NLRNVAFEIG RPSLEPTSAM RRNPANPYGR FSIDELFKMK VGVVSNNMAT TEQMAKIASD
IAGLGVPTEH VASVILQMVI MCACVSSSAF LDPEGSIEFE NGAVPVDSIA AIMKKHAGLR
KVCRLYAPIV WNSMLVRNQP PADWQAMGFQ YNTRFAAFDT FDYVTNQAAI QPVEGIIRRP
TSAEVIAHNA HKQLALDRSN RNERLGSLET EYTGGVQGAE IVRNHRYANN G
With the above UniProt text file as input, your program should generate a Fasta file with
the contents shown below. The file name should be the same as the accession number.
> P27335, 291 AA
MQSRPAQESGSASETPARGRPTPSDAPRDEPTNYNNNAESLLEQRLTRLIEKLNAEKHNS
NLRNVAFEIGRPSLEPTSAMRRNPANPYGRFSIDELFKMKVGVVSNNMATTEQMAKIASD
IAGLGVPTEHVASVILQMVIMCACVSSSAFLDPEGSIEFENGAVPVDSIAAIMKKHAGLR
KVCRLYAPIVWNSMLVRNQPPADWQAMGFQYNTRFAAFDTFDYVTNQAAIQPVEGIIRRP
TSAEVIAHNAHKQLALDRSNRNERLGSLETEYTGGVQGAEIVRNHRYANNG
The UniProt file name should be given by the user as input argument when starting the
program. An example run would look like this:
> perl task4.pl CAPSD_LSV.txt
> Sequence printed in Fasta format to file P27335.fasta
(Continue on the next page)
Bioinformatic analysis with Perl, BI118G, autumn 2021, assignment 3
5. Write a program (task5.pl) that takes a protein identifier as input and uses regular
expressions to check the following information in the corresponding UniProt text file:
a. Was the sequence integrated into UniProt in the 1990s?
This would be true for CAPSD_LSV based on the first DT line:
DT 01-AUG-1992, integrated into UniProtKB/Swiss-Prot
b. Is the protein from a primate?
This would be true for AL9A1_PONAB based on one of the OC lines:
OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
c. Is the sequence length a three-digit number?
This would be true for COAA_ECO7I based on the SQ line:
SQ SEQUENCE 316 AA; 36360 MW; DDDC1922C5C52A70 CRC64;
Example run:
> perl task5.pl CAPSD_LSV
> Yes, integrated into UniProt in the 1990s
> No, is not from a primate
> Yes, sequence length between 100 and 999
---------------------------------------------------------------------------------------
End of assignment.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of