2.1 KiB
Leetcode Repeated-DNA-Sequences
2022-09-06 19:58
Data structures:
#DS #hash_table #string
Difficulty:
#coding_problem #difficulty_medium
Additional tags:
#leetcode
Revisions:
N/A
Links:
Problem
The DNA sequence is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
.
- For example,
"ACGAATTCCG"
is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s
that represents a DNA sequence, return all the 10
-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Examples
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints
Thoughts
[!summary] This is a #hash_table problem.
The question ask for an answer, and the substrings can overlap. So, using a map is prefered(Why?)
Two reasons:
- Easy way to know if a array is a duplicate (set, map can suffice.)
- Keep information on how many duplicates found, so we only append it to the answer the first time we meet it.
One trip-over hole: in the for loop, upper bound should be:
for (int i = 0, top = s.size() - 9; i < top; i++)
^^^
Minus 9, because 9 is the extended length for an subarray starting with i.
1234567890
^ ^
|--------|
i i+9
i + 9 - i + 1 = 10.
With these edge-cases taken care of, we can proceed to the solution:
Solution
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<string, int> used;
vector<string> ans = {};
for (int i = 0, size = s.size() - 9; i < size; i++) {
string tmp = s.substr(i, 10);
if ((used[tmp]++) == 1) {
ans.push_back(tmp);
}
}
return ans;
}
};