
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2020-06-08 11:15

Programming languages for Bioinformatics

Spring 2020

Project, week 11

(All files mentioned below can be found under directory /home/faculty/ccwei/courses/

2020/plb/proj1/ in the course server).

1. Write a program to find differences between two files containing bioinformatics data.


Biodiff [options] from-file to-file

If you have two files A (from-file) and B (to-file), you are expected to generate all lines in A-B,

A & B, and B-A in terms of the criteria you set. The file format of file A and B can be different.

There will be two styles of comparison: one is coordinate based (option –c ) and the other is

name based (option –n). You can set one of these two options as the default style. The two

styles were described as follows.

1). Coordinate-based diff. Two columns from each of file A and B will be selected and

regions were created by the numbers from these two columns. These regions were then

compared to check if any two regions from A and B overlap or not. If two regions from the

two files overlap, then the lines corresponding to these two regions will be output into to

files called A&B_A and A&B_B; those lines corresponding to regions in A but not in A&B_A

will be output into A-B; and those lines corresponding to regions in B but not in A&B_B will

be output into B-A. Note, the comparison is based on the coordinates specified by two

columns set by the user, but the output result contains whole lines in the original files.

For example, we have two example files A_ucsc_genes.txt and B_ucsc_gene.gtf. If you run

Biodiff –c –a 3,4 –b 3,4 A_ucsc_genes.txt B_ucsc_gene.gtf

Column 3 and 4 from A_uscs_genes.txt file will be selected to represent a region and column

3 and 4 from B_ucsc_gene.gtf file will be selected to represent a region, then they are

compared. It should generate 4 result files corresponding to A&B_A, A&B_B, A-B, and B-A,

where A&B_A contains those lines from file A and overlap with some entries in file B; A&B_B

contains lines from file B and overlap with entries in file A; A-B contains those lines from file

A and have no overlapping entries in B; and B-A stands for those lines from file B but have

no overlapping entries in A.

2) Name-based diff. Two columns from file A and B will be selected and compared in terms

of string comparison. Users need to specify the column numbers in two files to be compared.

For example, two example track files (A_ucsc_genes.txt and B_ucsc_gene.gtf) were

downloaded from the WashU Genome Browser website (http://genomebrowser.wustl.edu ).

Both Files A_ucsc_genes.txt and B_ucsc_gene.gtf contain some UCSC genes with different

file formats. If you run

Biodiff –n –a 0 –b 8 A_ucsc_genes.txt B_ucsc_gene.gtf

The first column from A_uscs_genes.txt file and the 9th column from B_ucsc_gene.gtf file

will be selected and compared. If their names “overlap”, it should generate 4 result files

corresponding to A&B_A, A&B_B, A-B, and B-A, where A&B_A contains those lines from file

A and overlapping with some entries in file B; A&B_B contains lines from file B and

overlapping with entries in file A; A-B contains those lines from file A and with no

overlapping entries in B; and B-A stands for those lines from file B but with no overlapping

entries in A. Here, we call a string s “overlaps” with another string t, if s contains the whole

string t or t contains the whole string s.

Please write your program in C and test it thoroughly. Your program is expected to deal

with very large size files (the test files may be of hundred MBs). Both the accuracy and

speed will be evaluated for your program. (Hint: when you compare two files, first sort the

entries in each file based on the column of your pick; then compare them.)

In addition, please write your code as pretty as you can and put as much explanation as

you can.

You report should include at least 4 parts: 1). Design of the program; 2) implementation of the

program; 3). Usage of your program and test examples together with results. 4) Conclusions and

discussions. Part 2 should include the source code as the appendix.

Turning in your project report

Please hand in an electric copy of your homework report, which includes the source code, how

you compile it, how you test your program and the result of the test run of you program. You are

strongly suggested to test your code in a local machine or in the teaching server first before you

submit your homework.

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图
