A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information. The possible letters are A, C, G, and T, representing the four nucleotide subunits of a DNA strand: adenine, cytosine, guanine and thymine bases covalently linked to phospho-backbone.
DNA sequences undergo mutations during the evolution of species, which means that some letters are randomly replaced with others. Therefore, the DNA sequences of two closely related species are very similar, and the difference increases as the distance between the species increases. The mutations do not occur with uniform frequency throughout the sequence; typically there are fewer mutations at the biologically important parts, since even a single mutation can be lethal at such a place. On the other hand, if a part of the sequence does not carry any biologically relevant information, then mutations on this part have no effect. It follows that if we compare the DNA sequences of two species and a particular region of the sequence contains fewer than the average number of mutations, then most probably this part of the sequence plays an important biological role. Therefore, it is of crucial importance to identify such regions. More precisely, aconserved region is a consecutive interval of the DNA sequence such that in this region at most p percent of the letters are different in the two sequences. Your task is to write a program that, given two DNA sequences, finds the longest conserved region.
Input
The input contains several blocks of test cases. Each case begins with a line containing two integers: 1n150000 , the length of the genetic sequences and 1p99 , the maximum percentage of mutated letters allowed in a conserved region. This is followed by two lines, each containing a DNA sequence of length n . The sequence contains only the letters `A‘, `C‘, `G‘, and `T‘.
The input is terminated by a test case with n = 0 .
Output
For each test case, you have to output a line containing a single integer: the length of the longest conserved region between the two sequences. If there are no conserved regions in the input, then output `No solution.‘ (without quotes).
Sample Input
14 25 ACCGGTAACGTGAA ACTGGATACGTAAA 14 24 ACCGGTAACGTGAA ACTGGATACGTAAA 8 1 AAAAAAAA CCCCCCCC 8 33 AAACAAAA CCCCCCCC 0 0
Sample Output
8 7 No solution. 1题意:给定两个DNF序列,求最长序列长度,使得当中突变的DNF个数占比例不超过P。
思路:先求出1 - i的和 sum[i].然后每个区间的比例为 (sum[j] - sum[i]) / (j - i) >= p / 100转化为 sum[j] * 100 - p * j >= sum[i] * 100 - p * i。
如此一来发现只要sum[j] * 100 - p * j 比较大就能满足条件,那么对于每个位置j。如果后一个比前一个大,那么只需要用前一个就可以了,后一个就可以忽略不计了。因为后一个能满足的前一个肯定能满足且长度更长,如此一来就去构造一个递减序列,然后对于每个值,在这个序列中找到一个比他小且越靠前越好的即可,这步用到二分。
代码:
#include <stdio.h> #include <string.h> #define max(a,b) ((a)>(b)?(a):(b)) const int N = 150005; int n, P, ans, qn; char a[N], b[N]; struct Point { int sum, id; Point() { sum = 0; id = 0; } int cal() { return id * P - sum * 100; } } p[N], q[N]; int find(Point p) { int l = 0, r = qn - 1; while (l < r) { int mid = (l + r) / 2; if (p.cal() >= q[mid].cal()) r = mid; else l = mid + 1; } return q[l].id; } bool solve() { qn = 1; ans = 0; q[0] = p[0]; for (int i = 1; i <= n; i++) { p[i].sum = p[i - 1].sum + (a[i] != b[i]); p[i].id = i; if (p[i].cal() < q[qn - 1].cal()) q[qn++] = p[i]; else { int t = find(p[i]); ans = max(ans, p[i].id - t); } } if (ans) return true; return false; } int main() { while (~scanf("%d%d", &n, &P) && n) { scanf("%s%s", a + 1, b + 1); if (solve()) printf("%d\n", ans); else printf("No solution.\n"); } return 0; }