logseq_notes/pages/OJ notes/pages/Leetcode Repeated-DNA-Sequences.md
2023-06-14 14:27:22 +08:00

2.1 KiB

Leetcode Repeated-DNA-Sequences

2022-09-06 19:58

Data structures:

#DS #hash_table #string

Difficulty:

#coding_problems #difficulty_medium

Additional tags:

#leetcode

Revisions:

N/A

Problem

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

    When studying DNA, it is useful to identify repeated sequences within the DNA.

    Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Examples

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]

Constraints

Thoughts

[!summary] This is a #hash_table problem.

The question ask for an answer, and the substrings can overlap. So, using a map is prefered(Why?)

Two reasons:

  • Easy way to know if a array is a duplicate (set, map can suffice.)

  • Keep information on how many duplicates found, so we only append it to the answer the first time we meet it.

    One trip-over hole: in the for loop, upper bound should be:

    for (int i = 0, top = s.size() - 9; i < top; i++)
                                  ^^^
    

    Minus 9, because 9 is the extended length for an subarray starting with i.

    1234567890
    ^        ^
    |--------|
    i       i+9
    
    i + 9 - i + 1 = 10.
    

    With these edge-cases taken care of, we can proceed to the solution:

Solution

class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
  unordered_map<string, int> used;
  vector<string> ans = {};
  for (int i = 0, size = s.size() - 9; i < size; i++) {
    string tmp = s.substr(i, 10);

    if ((used[tmp]++) == 1) {
      ans.push_back(tmp);
    }
  }

  return ans;
}
};