Introduction
C++ Snippets
A string tokenizer
字符串分隔
1 | #include <string> |
C++ Snippets
字符串分隔
1 | #include <string> |
Given two arrays, write a function to compute their intersection.
Example:
Given nums1 = [1, 2, 2, 1]
, nums2 = [2, 2]
, return [2]
.
Note:
Each element in the result must be unique.
The result can be in any order.
Related Topics: Hash Table, Two Pointers, Binary Search, Sort
原问题: 349. Intersection of Two Arrays
中文翻译版: 349. 两个数组的交集
可以创建两个哈希表,首先遍历数组 nums1,将数组元素添加到第一个哈希表中,然后遍历数组 nums2,如果数组元素在第一个哈希表中不存在,则将该数组元素添加到第二个哈希表中,遍历完数组 nums2 后,将第二个哈希表的中元素作为结果输出。实现代码如下:
1 | #include <iostream> |
将数组 nums1 和数组 nums2 进行从小到大排序,设置两个指针 p1 和 p2 分别指向数组 nums1 和数组 nums2 的第一个元素。开始遍历数组时,如果指针 p1 和指针 p2 指向的元素不相等,则移动指向元素中较小值的指针,如果指针 p1 和指针 p2 指向的元素相等,则同时移动指针 p1 和 p2,并记录该相等的元素 (找到相等元素时先判断该元素是否和这个记录值相等,如果相等则说明交集已经存在该元素,不需要添加到交集中,否则添加到交集中),依次操作,直到其中一个集合没有元素可比较为止。实现代码如下:
1 | #include <algorithm> |
Give a string s
, count the number of non-empty (contiguous) substrings that have the same number of 0’s and 1’s, and all the 0’s and all the 1’s in these substrings are grouped consecutively.
Substrings that occur multiple times are counted the number of times they occur.
Example 1:
1 | Input: "00110011" |
Example 2:
1 | Input: "10101" |
Note:
s.length will be between 1 and 50,000.
s will only consist of “0” or “1” characters.
Related Topics: String
原问题: 696. Count Binary Substrings
中文翻译版: 696. 计数二进制子串
如果使用暴力方法进行解决,很容易超时,因为字符串 s
最长为 50000,因此需要考虑一些取巧的方法。这里可以创建一个数组,数组元素存储每个连续子串的长度,例如,将字符串 1111000011010001011
转化为数组 [4, 4, 2, 1, 1, 3, 1, 1, 2]
,根据这个数组,我们可以很方便地获得满足条件的子串个数,只需要将数组中相邻元素之间最小的那个求和即可,即
1 | n = min(4, 4) + min(4, 2) + min(2, 1) + min(1, 1) |
实现代码如下:
1 | #include <iostream> |
Given a string, find the first non-repeating character in it and return it’s index. If it doesn’t exist, return -1.
Examples:
1 | s = "leetcode" |
Note: You may assume the string contain only lowercase letters.
Related Topics: Hash Table, String
原问题: 387. First Unique Character in a String
中文翻译版: 387. 字符串中的第一个唯一字符
本题需要找到第一个不重复出现的字符的位置,因此在遍历字符串时需要保存字符的位置。由于字符串中只有小写字母,因此可以创建一个大小为 26 的数组,存储 26 个字母在字符串中出现位置,由于还需要判断该字母是否重复出现,这时可以用一个特殊值表示该字母在字符串中重复出现。数组元素的值分三种情况:
根据输入的字符串构造出这个数组后,我们可以很快找到第一个不重复出现的字符的位置 (即数组元素中值最小的那个,不包括 -1 和 0)。实现代码如下:
1 | #include <iostream> |
对于K均值算法中聚类个数K的选择,通常有四种方法:
本文相关算法的代码以及实验在:GitHub
简单地说就是按照建模的需求和目的来选择聚类的个数。比如说,一个游戏公司想把所有玩家做聚类分析,分成顶级、高级、中级、菜鸟四类,那么K=4;如果房地产公司想把当地的商品房分成高中低三档,那么K=3。按需选择法虽然合理,但是未必能保证在做K-Means时能够得到清晰的分界线。
观察法就是将数据可视化,然后用肉眼去观察这些数据点大概分成几堆。这个方法有个缺点,那就是数据维度不能太高,一般是二维或者三维,否则数据无法可视化出来。
手肘法 (Elbow Method) 本质上也是一种间接的观察法。当我们对数据 ${ x_1, \ldots, x_N }$ 进行K均值聚类后,我们将得到 $K$ 个聚类的中心点 $\mu_k, k=1,2, \ldots, K$,以及数据点 $x_i$ 所对应的簇 $C_k, k=1, 2, \ldots, K$,每个簇中有 $n_k$ 个数据点。我们定义每个簇中的所有点相互之间的距离的和为
$$
D_k = \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2
$$
其中 $\Vert \cdot \Vert$ 为2范数。则对于聚类个数为 $K$ 时,我们可以定义一个测度量为
$$
W_K = \sum_{k=1}^K \frac{1}{2n_k} D_k
$$
对于不同的 $K$,经过 K-means 算法后我们会得到不同的中心点和数据点所属的簇,从而得到不同的度量 $W_K$。将聚类个数 $K$ 作为横坐标,$W_K$ 作为纵坐标,我们可以得到类似下面的图像:
图像中图形很像人的手肘,该方法的命名就是从这而来。从图像中我们可以观察到,$K = 1$ 到 $4$ 下降很快,$K = 4$ 之后趋于平稳,因此 $K = 4$ 是一个拐点,手肘法认为这个拐点就是最佳的聚类个数 $K$。
手肘法是一个经验方法,观察结果因人而异,特别是遇到拐点位置不是很明显时。相比于前面的观察法,该方法的优点在于其适用于高维度的数据。
如果每个簇的中心点为该簇中的所有数据点的平均值时,即
$$
\mu_k = \frac{1}{n_k} \sum_{x_i \in C_k} x_i
$$
则 $D_k$ 可以写为
$$
D_k = \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 = 2n_k \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$
$W_K$ 写为
$$
W_K = \sum_{k=1}^K \frac{1}{2n_k} D_k = \sum_{k=1}^K \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$
Note: 以下为公式推导内容,可略过。将 $D_k$ 表达式分解可得:
$$
\begin{align}
& \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 \
= & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert (x_i - \mu_k) - (x_j - \mu_k) \Vert^2 \
= & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( \Vert x_i - \mu_k \Vert^2 + \Vert x_j - \mu_k \Vert^2 - 2 \left< x_i - \mu_k, x_j - \mu_k \right> \right) \
= & \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 + \sum_{x_j \in C_k} n_k \Vert x_j - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \
= & 2 \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right>
\end{align}
$$
其中
$$
\begin{align}
& \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \
= & \sum_{x_i \in C_k} \left< x_i - \mu_k, \sum_{x_j \in C_k} (x_j - \mu_k) \right>
\end{align}
$$
由于
$$
\sum_{x_j \in C_k} (x_j - \mu_k) = \sum_{x_j \in C_k} x_j - n_k \mu_k = ; n_k \mu_k - n_k \mu_k = 0
$$
则
$$
\sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> = 0
$$
因此可以得到
$$
\sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 = 2 n_k \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$
这个方法来源于论文[3],详细内容可以去阅读该论文。延用前面关于度量 $W_K$ 的计算方式,该算法的流程如下:
cluster the observed data, varying the total number of clusters from $K=1, 2, \ldots, K_{\text{max}}$, and giving within-dispersion measures $W_K, K=1, 2, \ldots, K_{\text{max}}$
generate $B$ reference data sets and cluster each one giving within-dispersion measures $W_{Kb}$, $b=1,2, \ldots, B$, $K=1, 2, \ldots, K_{\text{max}}$. Compute the gap statistic
$$
\text{Gap}(K) =\frac{1}{B} \sum_{b=1}^B \log(W_{Kb}) - \log(W_K)
$$
let $\bar{w} = \frac{1}{B} \sum_{b=1}^B \log(W_{Kb})$, compute the standard deviation
$$
\text{sd}K = \left[ \frac{1}{B} \sum{b=1}^B (\log(W_{Kb}) - \bar{w})^2 \right]^{\frac{1}{2}}
$$
and define $s_K = \text{sd}_K \sqrt{1 + \frac{1}{B}}$
choose the number of clusters as the smallest $K$ such that $\text{Gap(K)} \ge \text{Gap(K+1)} - s_{K+1}$
关于 reference data sets 的生成,论文给了两种生成方法,这里只用其中一个,即对于生成的样本 $x_r$,其维度为 $d$,该样本的元素 $x_{ri}$ 的值按照均匀分布 $U(a, b)$ 随机生成,其中 $a$ 为待聚类的所有数据点的第 $i$ 位置元素的最小值, $b$ 为数据点的第 $i$ 位置元素的最大值。经过实验发现,这个算法的 3,4 步骤可以去掉,只用第 2 步计算出的 $\text{Gap}(K)$,然后选择最大 $\text{Gap}(K)$ 所对应的 $K$ 作为最优的聚类个数。这里贴出从一个数据集中得到的 $\text{Gap}(K)$ 的变化曲线图
可以发现,当 $K = 4$ 时,$\text{Gap}(K)$ 取得最大值,因此选择 $K = 4$ 作为最优的聚类个数。该算法实现的代码为:
1 | import numpy as np |
Given a paragraph and a list of banned words, return the most frequent word that is not in the list of banned words. It is guaranteed there is at least one word that isn’t banned, and that the answer is unique.
Words in the list of banned words are given in lowercase, and free of punctuation. Words in the paragraph are not case sensitive. The answer is in lowercase.
Example:
1 | Input: |
Note:
1 <= paragraph.length <= 1000
.1 <= banned.length <= 100
.1 <= banned[i].length <= 10
.paragraph
may have uppercase symbols, and even if it is a proper noun.)paragraph
only consists of letters, spaces, or the punctuation symbols !?',;.
paragraph
are always separated by a space.Related Topics: String
中文翻译版: 819. Most Common Word
题目要求在字符串 paragraph
中找到出现次数最多的单词,并且单词不能是列表 banned
中的单词。解决该题可以设定两个哈希表,第一个哈希表的 key 为列表 banned
中的单词,用于判断单词是否在列表中,第二个哈希表的 key 为字符串 paragraph
中的单词,value 为该单词的出现次数。依次取出字符串 paragraph
中的单词,由于题目要求单词都是小写,所以要将单词转换成小写,然后在第一个哈希表中查找该单词是否存在,如果不存在,将第二个哈希表中该单词的出现次数加1,否则取字符串下一个单词,依次重复这个操作直到遍历完字符串 paragraph
的所有单词,接着根据第二个哈希表找出出现次数最多的单词。
1 | #include <iostream> |
Initially, there is a Robot at position (0, 0). Given a sequence of its moves, judge if this robot makes a circle, which means it moves back to the original place.
The move sequence is represented by a string. And each move is represent by a character. The valid robot moves are R
(Right), L
(Left), U
(Up) and D
(down). The output should be true or false representing whether the robot makes a circle.
Example 1:
1 | Input: "UD" |
Example 2:
1 | Input: "LL" |
Related Topics: String
中文翻译版: 657. 判断路线成圈
根据题目描述,可以设定一个二维坐标 (x, y),根据指令做相应的改变:
最后判断二维坐标 (x, y) 是否为原点 (0, 0) 即可。实现代码如下:
1 | #include <iostream> |
Roman numerals are represented by seven different symbols: I, V, X, L, C, D and M.
1 | Symbol Value |
For example, two is written as II
in Roman numeral, just two one’s added together. Twelve is written as, XII
, which is simply X
+ II
. The number twenty seven is written as XXVII
, which is XX
+ V
+ II
.
Roman numerals are usually written largest to smallest from left to right. However, the numeral for four is not IIII
. Instead, the number four is written as IV
. Because the one is before the five we subtract it making four. The same principle applies to the number nine, which is written as IX
. There are six instances where subtraction is used:
I
can be placed before V
(5) and X
(10) to make 4 and 9.X
can be placed before L
(50) and C
(100) to make 40 and 90.C
can be placed before D
(500) and M
(1000) to make 400 and 900.Given a roman numeral, convert it to an integer. Input is guaranteed to be within the range from 1 to 3999.
Example 1:
1 | Input: "III" |
Example 2:
1 | Input: "IV" |
Example 3:
1 | Input: "IX" |
Example 4:
1 | Input: "LVIII" |
Example 5:
1 | Input: "MCMXCIV" |
Related Topics: Math, String
原问题: 13. Roman to Integer
中文翻译版: 13. 罗马数字转整数
题目难度不大,根据题目描述可以写出代码
1 | #include <iostream> |
Given an arbitrary ransom note string and another string containing letters from all the magazines, write a function that will return true if the ransom note can be constructed from the magazines ; otherwise, it will return false.
Each letter in the magazine string can only be used once in your ransom note.
Note:
You may assume that both strings contain only lowercase letters.
1 | canConstruct("a", "b") -> false |
Related Topics: String
原问题:383. Ransom Note
中文翻译版:383. 赎金信
这道题要求字符串 ransom note 中的字符都能从字符串 magazine 中找到,而且 ransom note 中的字符的重复次数要小于等于 magazine 相对应字符的重复次数。基于上面的要求,有个很简单的解决方案,对字符串 ransom note 和字符串 magazine 分别构造一个大小为 26 的数组,存储 26 个字母在字符串中出现的次数,然后遍历数组,判断 ransom note 中的字符的重复次数是否小于等于 magazine 相对应字符的重复次数。
1 | #include <iostream> |
Given a group of two strings, you need to find the longest uncommon subsequence of this group of two strings. The longest uncommon subsequence is defined as the longest subsequence of one of these strings and this subsequence should not be any subsequence of the other strings.
A subsequence is a sequence that can be derived from one sequence by deleting some characters without changing the order of the remaining elements. Trivially, any string is a subsequence of itself and an empty string is a subsequence of any string.
The input will be two strings, and the output needs to be the length of the longest uncommon subsequence. If the longest uncommon subsequence doesn’t exist, return -1.
Example 1:
1 | Input: "aba", "cdc" |
Note:
Related Topics: String
原问题:521. Longest Uncommon Subsequence I
中文翻译版:521. 最长特殊序列 Ⅰ
这道题粗看以为是一道很难的题目,但是解决该问题只需要考虑三种情况:
贴上实现代码
1 | #include <iostream> |