notes/OJ notes/pages/Leetcode Repeated-DNA-Sequences.md
2022-09-06 20:22:48 +08:00

2.1 KiB

Leetcode Repeated-DNA-Sequences

2022-09-06 19:58

Data structures:

#DS #hash_table #string

Difficulty:

#coding_problem #difficulty_medium

Additional tags:

#leetcode

Revisions:

N/A


Problem

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Examples

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]

Constraints

Thoughts

[!summary] This is a #hash_table problem.

The question ask for an answer, and the substrings can overlap. So, using a map is prefered(Why?)

Two reasons:

  • Easy way to know if a array is a duplicate (set, map can suffice.)
  • Keep information on how many duplicates found, so we only append it to the answer the first time we meet it.

One trip-over hole: in the for loop, upper bound should be:

for (int i = 0, top = s.size() - 9; i < top; i++)
                                ^^^

Minus 9, because 9 is the extended length for an subarray starting with i.

1234567890
^        ^
|--------|
i       i+9

i + 9 - i + 1 = 10.

With these edge-cases taken care of, we can proceed to the solution:

Solution

class Solution {
public:
  vector<string> findRepeatedDnaSequences(string s) {
    unordered_map<string, int> used;
    vector<string> ans = {};
    for (int i = 0, size = s.size() - 9; i < size; i++) {
      string tmp = s.substr(i, 10);

      if ((used[tmp]++) == 1) {
        ans.push_back(tmp);
      }
    }

    return ans;
  }
};