袋熊的树洞

日拱一卒,功不唐捐

0%

Introduction

Shell Snippets

列出目录里面文件名并进行排序

在目录 train 里面有大量的图片 (扩展名为 png)

1
2
3
4
5
6
7
8
9
10
11
$ ls -l train/ | head
total 400000
-rw-r--r--@ 1 luowanqian staff 2455 10 19 2013 1.png
-rw-r--r--@ 1 luowanqian staff 2101 10 19 2013 10.png
-rw-r--r--@ 1 luowanqian staff 2466 10 19 2013 100.png
-rw-r--r--@ 1 luowanqian staff 2171 10 19 2013 1000.png
-rw-r--r--@ 1 luowanqian staff 2031 10 19 2013 10000.png
-rw-r--r--@ 1 luowanqian staff 2051 10 19 2013 10001.png
-rw-r--r--@ 1 luowanqian staff 2317 10 19 2013 10002.png
-rw-r--r--@ 1 luowanqian staff 2511 10 19 2013 10003.png
-rw-r--r--@ 1 luowanqian staff 2460 10 19 2013 10004.png

提取该目录下的所有图片的文件名并进行排序

方案1

1
2
3
4
5
6
7
8
9
10
11
12
$ ls -l train/ | grep ".png" | tr -s ' ' | cut -d ' ' -f 9 | sort -n > list.txt
$ head list.txt
1.png
2.png
3.png
4.png
5.png
6.png
7.png
8.png
9.png
10.png

其中

  • tr -s ' ' 是为了将多个space压缩成一个space
  • cut -d ' ' -f 9 是为了取出最后一列的文件名
  • sort -n 是将文件进行排序,由于文件名里面有数字,所以用-n选项

方案2

1
2
3
4
5
6
7
8
9
10
11
12
$ ls train/ | sort -n > list.txt
$ head list.txt
1.png
2.png
3.png
4.png
5.png
6.png
7.png
8.png
9.png
10.png

pip升级所有包

pip 升级所有Python包,命令如下

1
pip3 list --outdated --format=freeze | cut -d = -f 1 | xargs pip3 install -U'

其中 pip3 可以根据 pip 版本替换。

rsync断点续传

主要使用的是 rsync 的 -P 的选项,传输命令写为

1
$ rsync -P ubuntu/ubuntu-16.04.4-desktop-amd64.iso lab217server:/home/luowanqian/Downloads/

如果传输过程中网络中断或者使用了 Ctrl + C,此时可以再次使用该命令进行断点续传。

参考:

  1. How To Resume Partially Transferred Files Over SSH Using Rsync

介绍

关于ROC曲线的详细介绍,可以参考周志华的西瓜书 (《机器学习》),本文主要介绍如何使用Python绘制该曲线。ROC曲线的纵轴是“真正例率” (True Positive Rate,简称TPR),横轴是“假正例率” (False Positive Rate,简称FPR),两者定义为:
$$
\begin{align}
TPR & = \frac{TP}{TP + FN} \
FPR & = \frac{FP}{TN + FP}
\end{align}
$$
相关的符号定义为:

在现实任务中,我们获取有限个 (FPR, TPR) 坐标对来勾勒出ROC曲线。要获得多个坐标对需要多组二分类结果,而我们做二分类任务时,通常只能获取一组分类结果,此时我们利用这组结果生成多组分类结果,具体做法是:给定 $m$ 个正例和 $n$ 个反例,根据分类器的预测值对样例进行从大到小排序,然后把分类阈值设为样例预测值中最大的那个,将样例进行正反例分类,计算相应的TPR和FPR,然后令分类阈值依次设为每个样例的预测值,将样例进行正反例分类,接着计算TPR和FPR,重复这个操作,直到分类阈值取完所有样例的预测值。

关于分类器对样例的预测值,我们可以用分类器判定样例为正例的概率值,也可以用分类器对样例的评分值,预测值大于等于分类阈值时,分类器判定样例为正例,否则判定为负例。

例子

已知10样例的真实标签 (0: 反例,1: 正例) 以及分类器对该样例的评分值

1
2
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 1, 0]
y_score = [0.505, 0.6, 0.8, 0.52, 0.55, 0.53, 0.54, 0.9, 0.51, 0.7]

根据评分值将样例进行从大到小排序,可以得到

1
2
y_true = [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]
y_score = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505]

将分类阈值设为 0.9,此时样例的分类为

1
2
y_true = [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]
y_predict = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

可以得到 TP = 1,FP = 0,FN = 5,TN = 4,则可得到 TPR = 1/(1+5) = 1/6 以及 FPR = 0。接着将分类阈值设为 0.8,然后将样例进行正反例分类,接着计算相应的TPR和FPR,直到分类阈值取完所有评分值,最终可以得到下面的TPR和FPR列表

1
2
TPR = [0.16666667, 0.33333333, 0.33333333, 0.5, 0.66666667, 0.83333333, 0.83333333, 0.83333333, 1.0, 1.0]
FPR = [0.0, 0.0, 0.25, 0.25, 0.25, 0.25, 0.5, 0.75, 0.75, 1.0]

然后根据这一系列的 (FPR, TPR) 坐标对画出ROC曲线。

实现

相关代码以及运用可以参考:GitHub

Scikit-learn实现

Scikit-learn库提供了一个名为 roc_curve 函数来获取FPR以及TRP,函数原型如下

1
sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

要想得到上面例子的计算结果,需要把 drop_intermediate 设为 False。

个人实现

我这里直接照搬[1]中的实现,贴出代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def binary_clf_curve(y_true, y_score, pos_label=None):
if pos_label is None:
pos_label = 1.0

# make y_true a boolean vector
y_true = (y_true == pos_label)

# sort scores
desc_score_indices = np.argsort(y_score)[::-1]
thresholds = y_score[desc_score_indices]

fps = []
tps = []
for threshold in thresholds:
# 大于等于阈值判定为 1 (正类),否则为 0 (负类)
y_predict = [1 if i >= threshold else 0 for i in y_score]
# 预测值是否等于真实值
result = [i == j for i, j in zip(y_true, y_predict)]
# 预测值是否为 1 (正类)
positive = [i == 1 for i in y_predict]

# 预测为正类且预测错误
fp = [(not i) and j for i, j in zip(result, positive)].count(True)
# 预测为正类且预测正确
tp = [i and j for i, j in zip(result, positive)].count(True)

fps.append(fp)
tps.append(tp)
fps = np.array(fps)
tps = np.array(tps)

return fps, tps, thresholds


def roc_curve(y_true, y_score, pos_label=None):
fps, tps, thresholds = binary_clf_curve(y_true, y_score, pos_label)

fpr = fps / fps[-1]
tpr = tps / tps[-1]

return fpr, tpr, thresholds

参考

  1. 通过一个例子来绘制一条ROC曲线?
  2. roc-auc
  3. ROC和AUC介绍以及如何计算AUC

Introduction

C++ Snippets

A string tokenizer

字符串分隔

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <string>
#include <vector>
#include <iostream>

using namespace std;

void tokenize(const string &str, vector<string> &tokens, const string &delimiters = " ")
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);

while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}

int main()
{
vector<string> tokens;
string str("Split me up! Word1 Word2 Word3.");

tokenize(str, tokens);

cout << "String: " << str << endl;
cout << "Tokens: " << endl;
for (auto it = tokens.begin(); it != tokens.end(); it++)
cout << *it << endl;
return 0;
}

问题描述

Given two arrays, write a function to compute their intersection.

Example:

Given nums1 = [1, 2, 2, 1], nums2 = [2, 2], return [2].

Note:

  • Each element in the result must be unique.

  • The result can be in any order.

Related Topics: Hash Table, Two Pointers, Binary Search, Sort

原问题: 349. Intersection of Two Arrays

中文翻译版: 349. 两个数组的交集

解决方案

方案1

可以创建两个哈希表,首先遍历数组 nums1,将数组元素添加到第一个哈希表中,然后遍历数组 nums2,如果数组元素在第一个哈希表中不存在,则将该数组元素添加到第二个哈希表中,遍历完数组 nums2 后,将第二个哈希表的中元素作为结果输出。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <iostream>
#include <vector>
#include <unordered_set>
using namespace std;

class Solution {
public:
vector<int> intersection(vector<int>& nums1, vector<int>& nums2)
{
unordered_set<int> unique_nums1;
for (auto it=nums1.begin(); it!=nums1.end(); it++)
unique_nums1.insert(*it);

unordered_set<int> unique_nums;
for (auto it=nums2.begin(); it!=nums2.end(); it++) {
if (unique_nums1.find(*it) != unique_nums1.end())
unique_nums.insert(*it);
}

vector<int> ret;
for (auto it=unique_nums.begin(); it!=unique_nums.end(); it++)
ret.push_back(*it);

return ret;
}
};

int main()
{
Solution solu;
vector<int> nums1 = {1, 2, 2, 1};
vector<int> nums2 = {2, 2};
vector<int> ret = solu.intersection(nums1, nums2);

for (auto it=ret.begin(); it!=ret.end(); it++)
cout << *it << " ";
cout << endl;

return 0;
}

方案2

将数组 nums1 和数组 nums2 进行从小到大排序,设置两个指针 p1 和 p2 分别指向数组 nums1 和数组 nums2 的第一个元素。开始遍历数组时,如果指针 p1 和指针 p2 指向的元素不相等,则移动指向元素中较小值的指针,如果指针 p1 和指针 p2 指向的元素相等,则同时移动指针 p1 和 p2,并记录该相等的元素 (找到相等元素时先判断该元素是否和这个记录值相等,如果相等则说明交集已经存在该元素,不需要添加到交集中,否则添加到交集中),依次操作,直到其中一个集合没有元素可比较为止。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;

class Solution {
public:
vector<int> intersection(vector<int>& nums1, vector<int>& nums2)
{
vector<int> ret;
if (nums1.size() == 0)
return ret;
else if (nums2.size() == 0)
return ret;

sort(nums1.begin(), nums1.end());
sort(nums2.begin(), nums2.end());

int p1, p2, repeat;
p1 = p2 = 0;
repeat = nums1[0] < nums2[0] ? nums1[0] - 1 : nums2[0] - 1;
while (p1 < nums1.size() && p2 < nums2.size()) {
if (nums1[p1] < nums2[p2]) {
p1++;
} else if (nums1[p1] > nums2[p2]) {
p2++;
} else {
if (repeat != nums1[p1])
ret.push_back(nums1[p1]);
repeat = nums1[p1];
p1++;
p2++;
}
}

return ret;
}
};

int main()
{
Solution solu;
vector<int> nums1 = {1, 2, 2, 1};
vector<int> nums2 = {2, 2};
vector<int> ret = solu.intersection(nums1, nums2);

for (auto it=ret.begin(); it!=ret.end(); it++)
cout << *it << " ";
cout << endl;

return 0;
}

问题描述

Give a string s, count the number of non-empty (contiguous) substrings that have the same number of 0’s and 1’s, and all the 0’s and all the 1’s in these substrings are grouped consecutively.

Substrings that occur multiple times are counted the number of times they occur.

Example 1:

1
2
3
4
5
6
7
Input: "00110011"
Output: 6
Explanation: There are 6 substrings that have equal number of consecutive 1's and 0's: "0011", "01", "1100", "10", "0011", and "01".

Notice that some of these substrings repeat and are counted the number of times they occur.

Also, "00110011" is not a valid substring because all the 0's (and 1's) are not grouped together.

Example 2:

1
2
3
Input: "10101"
Output: 4
Explanation: There are 4 substrings: "10", "01", "10", "01" that have equal number of consecutive 1's and 0's.

Note:

  • s.length will be between 1 and 50,000.

  • s will only consist of “0” or “1” characters.

Related Topics: String

原问题: 696. Count Binary Substrings

中文翻译版: 696. 计数二进制子串

解决方案

如果使用暴力方法进行解决,很容易超时,因为字符串 s 最长为 50000,因此需要考虑一些取巧的方法。这里可以创建一个数组,数组元素存储每个连续子串的长度,例如,将字符串 1111000011010001011 转化为数组 [4, 4, 2, 1, 1, 3, 1, 1, 2],根据这个数组,我们可以很方便地获得满足条件的子串个数,只需要将数组中相邻元素之间最小的那个求和即可,即

1
2
3
4
n = min(4, 4) + min(4, 2) + min(2, 1) + min(1, 1)
+ min(1, 3) + min(3, 1) + min(1, 1) + min(1, 2)
= 4 + 2 + 1 + 1 + 1 + 1 + 1 + 1
= 12

实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <iostream>
#include <string>
#include <vector>
using namespace std;

class Solution {
public:
int countBinarySubstrings(string s)
{
vector<int> trans;
int num_bs = 0;
int count;

count = 0;
for (int i=0; i<s.size(); i++) {
count++;
if (i+1 < s.size() && s[i] != s[i+1]) {
trans.push_back(count);
count = 0;
continue;
}
}
if (count > 0)
trans.push_back(count);

for (int i=0; i<trans.size()-1; i++)
num_bs += trans[i] > trans[i+1] ? trans[i+1] : trans[i];

return num_bs;
}
};

int main()
{
Solution solu;

string str1 = "00110011";
cout << str1 << ": " << solu.countBinarySubstrings(str1) << endl;

string str2 = "10101";
cout << str2 << ": " << solu.countBinarySubstrings(str2) << endl;

return 0;
}

问题描述

Given a string, find the first non-repeating character in it and return it’s index. If it doesn’t exist, return -1.

Examples:

1
2
3
4
5
s = "leetcode"
return 0.

s = "loveleetcode",
return 2.

Note: You may assume the string contain only lowercase letters.

Related Topics: Hash Table, String

原问题: 387. First Unique Character in a String

中文翻译版: 387. 字符串中的第一个唯一字符

解决方案

本题需要找到第一个不重复出现的字符的位置,因此在遍历字符串时需要保存字符的位置。由于字符串中只有小写字母,因此可以创建一个大小为 26 的数组,存储 26 个字母在字符串中出现位置,由于还需要判断该字母是否重复出现,这时可以用一个特殊值表示该字母在字符串中重复出现。数组元素的值分三种情况:

  • -1 表示该字母已经重复出现
  • 0 表示该字母未在字符串中出现
  • 正整数表示该字母第一次在字符串中出现的位置 (位置从 1 开始)

根据输入的字符串构造出这个数组后,我们可以很快找到第一个不重复出现的字符的位置 (即数组元素中值最小的那个,不包括 -1 和 0)。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <string>
#include <cstring>
using namespace std;

class Solution {
public:
int firstUniqChar(string s)
{
if (s.size() == 0)
return -1;
else if (s.size() == 1)
return 0;

/* letters[i]
* -1 means letter repeat
* 0 means letter not appear
* > 0 means position of letter in string (index starts from 1)
*/
int letters[26];
int pos;

memset(letters, 0, sizeof(letters));
for (int i=0; i<s.size(); i++) {
pos = s[i] - 'a';
if (letters[pos] > 0)
letters[pos] = -1;
else if (letters[pos] == 0)
letters[pos] = i + 1;
}

int ret = -1;
for (int i=0; i<26; i++) {
if (letters[i] > 0) {
if (ret == -1)
ret = letters[i] - 1;
else if (ret > letters[i] - 1)
ret = letters[i] - 1;
}
}

return ret;
}
};

int main()
{
Solution solu;

string str1 = "aadadaad";
cout << str1 << ": " << solu.firstUniqChar(str1) << endl;

string str2 = "loveleetcode";
cout << str2 << ": " << solu.firstUniqChar(str2) << endl;

return 0;
}

介绍

对于K均值算法中聚类个数K的选择,通常有四种方法:

  • 按需选择
  • 观察法
  • 手肘法
  • Gap Statistic方法

本文相关算法的代码以及实验在:GitHub

按需选择

简单地说就是按照建模的需求和目的来选择聚类的个数。比如说,一个游戏公司想把所有玩家做聚类分析,分成顶级、高级、中级、菜鸟四类,那么K=4;如果房地产公司想把当地的商品房分成高中低三档,那么K=3。按需选择法虽然合理,但是未必能保证在做K-Means时能够得到清晰的分界线。

观察法

观察法就是将数据可视化,然后用肉眼去观察这些数据点大概分成几堆。这个方法有个缺点,那就是数据维度不能太高,一般是二维或者三维,否则数据无法可视化出来。

手肘法

手肘法 (Elbow Method) 本质上也是一种间接的观察法。当我们对数据 ${ x_1, \ldots, x_N }$ 进行K均值聚类后,我们将得到 $K$ 个聚类的中心点 $\mu_k, k=1,2, \ldots, K$,以及数据点 $x_i$ 所对应的簇 $C_k, k=1, 2, \ldots, K$,每个簇中有 $n_k$ 个数据点。我们定义每个簇中的所有点相互之间的距离的和为
$$
D_k = \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2
$$
其中 $\Vert \cdot \Vert$ 为2范数。则对于聚类个数为 $K$ 时,我们可以定义一个测度量为
$$
W_K = \sum_{k=1}^K \frac{1}{2n_k} D_k
$$
对于不同的 $K$,经过 K-means 算法后我们会得到不同的中心点和数据点所属的簇,从而得到不同的度量 $W_K$。将聚类个数 $K$ 作为横坐标,$W_K$ 作为纵坐标,我们可以得到类似下面的图像:

图像中图形很像人的手肘,该方法的命名就是从这而来。从图像中我们可以观察到,$K = 1$ 到 $4$ 下降很快,$K = 4$ 之后趋于平稳,因此 $K = 4$ 是一个拐点,手肘法认为这个拐点就是最佳的聚类个数 $K$。

手肘法是一个经验方法,观察结果因人而异,特别是遇到拐点位置不是很明显时。相比于前面的观察法,该方法的优点在于其适用于高维度的数据。

Trick

如果每个簇的中心点为该簇中的所有数据点的平均值时,即
$$
\mu_k = \frac{1}{n_k} \sum_{x_i \in C_k} x_i
$$
则 $D_k$ 可以写为
$$
D_k = \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 = 2n_k \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$
$W_K$ 写为
$$
W_K = \sum_{k=1}^K \frac{1}{2n_k} D_k = \sum_{k=1}^K \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$

Note: 以下为公式推导内容,可略过。将 $D_k$ 表达式分解可得:
$$
\begin{align}
& \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 \
= & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert (x_i - \mu_k) - (x_j - \mu_k) \Vert^2 \
= & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( \Vert x_i - \mu_k \Vert^2 + \Vert x_j - \mu_k \Vert^2 - 2 \left< x_i - \mu_k, x_j - \mu_k \right> \right) \
= & \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 + \sum_{x_j \in C_k} n_k \Vert x_j - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \
= & 2 \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right>
\end{align}
$$
其中
$$
\begin{align}
& \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \
= & \sum_{x_i \in C_k} \left< x_i - \mu_k, \sum_{x_j \in C_k} (x_j - \mu_k) \right>
\end{align}
$$
由于
$$
\sum_{x_j \in C_k} (x_j - \mu_k) = \sum_{x_j \in C_k} x_j - n_k \mu_k = ; n_k \mu_k - n_k \mu_k = 0
$$

$$
\sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> = 0
$$
因此可以得到
$$
\sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 = 2 n_k \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2
$$

Gap Statistic方法

这个方法来源于论文[3],详细内容可以去阅读该论文。延用前面关于度量 $W_K$ 的计算方式,该算法的流程如下:

  1. cluster the observed data, varying the total number of clusters from $K=1, 2, \ldots, K_{\text{max}}$, and giving within-dispersion measures $W_K, K=1, 2, \ldots, K_{\text{max}}$

  2. generate $B$ reference data sets and cluster each one giving within-dispersion measures $W_{Kb}$, $b=1,2, \ldots, B$, $K=1, 2, \ldots, K_{\text{max}}$. Compute the gap statistic
    $$
    \text{Gap}(K) =\frac{1}{B} \sum_{b=1}^B \log(W_{Kb}) - \log(W_K)
    $$

  3. let $\bar{w} = \frac{1}{B} \sum_{b=1}^B \log(W_{Kb})$, compute the standard deviation
    $$
    \text{sd}K = \left[ \frac{1}{B} \sum{b=1}^B (\log(W_{Kb}) - \bar{w})^2 \right]^{\frac{1}{2}}
    $$
    and define $s_K = \text{sd}_K \sqrt{1 + \frac{1}{B}}$

  4. choose the number of clusters as the smallest $K$ such that $\text{Gap(K)} \ge \text{Gap(K+1)} - s_{K+1}$

关于 reference data sets 的生成,论文给了两种生成方法,这里只用其中一个,即对于生成的样本 $x_r$,其维度为 $d$,该样本的元素 $x_{ri}$ 的值按照均匀分布 $U(a, b)$ 随机生成,其中 $a$ 为待聚类的所有数据点的第 $i$ 位置元素的最小值, $b$ 为数据点的第 $i$ 位置元素的最大值。经过实验发现,这个算法的 3,4 步骤可以去掉,只用第 2 步计算出的 $\text{Gap}(K)$,然后选择最大 $\text{Gap}(K)$ 所对应的 $K$ 作为最优的聚类个数。这里贴出从一个数据集中得到的 $\text{Gap}(K)$ 的变化曲线图

可以发现,当 $K = 4$ 时,$\text{Gap}(K)$ 取得最大值,因此选择 $K = 4$ 作为最优的聚类个数。该算法实现的代码为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np


def calculate_Wk(data, centroids, cluster):
K = centroids.shape[0]
wk = 0.0
for k in range(K):
data_in_cluster = data[cluster == k, :]
center = centroids[k, :]
num_points = data_in_cluster.shape[0]
for i in range(num_points):
wk = wk + np.linalg.norm(data_in_cluster[i, :]-center, ord=2) ** 2

return wk


def bounding_box(data):
dim = data.shape[1]
boxes = []
for i in range(dim):
data_min = np.amin(data[:, i])
data_max = np.amax(data[:, i])
boxes.append((data_min, data_max))

return boxes


def gap_statistic(data, max_K, B, cluster_algorithm):
num_points, dim = data.shape
K_range = np.arange(1, max_K, dtype=int)
num_K = len(K_range)
boxes = bounding_box(data)
data_generate = np.zeros((num_points, dim))

''' 写法1
log_Wks = np.zeros(num_K)
gaps = np.zeros(num_K)
sks = np.zeros(num_K)
for ind_K, K in enumerate(K_range):
cluster_centers, labels, _ = cluster_algorithm(data, K)
log_Wks[ind_K] = np.log(calculate_Wk(data, cluster_centers, labels))

# generate B reference data sets
log_Wkbs = np.zeros(B)
for b in range(B):
for i in range(num_points):
for j in range(dim):
data_generate[i][j] = \
np.random.uniform(boxes[j][0], boxes[j][1])
cluster_centers, labels, _ = cluster_algorithm(data_generate, K)
log_Wkbs[b] = \
np.log(calculate_Wk(data_generate, cluster_centers, labels))
gaps[ind_K] = np.mean(log_Wkbs) - log_Wks[ind_K]
sks[ind_K] = np.std(log_Wkbs) * np.sqrt(1 + 1.0 / B)
'''

''' 写法2
'''
log_Wks = np.zeros(num_K)
for indK, K in enumerate(K_range):
cluster_centers, labels, _ = cluster_algorithm(data, K)
log_Wks[indK] = np.log(calculate_Wk(data, cluster_centers, labels))

gaps = np.zeros(num_K)
sks = np.zeros(num_K)
log_Wkbs = np.zeros((B, num_K))

# generate B reference data sets
for b in range(B):
for i in range(num_points):
for j in range(dim):
data_generate[i, j] = \
np.random.uniform(boxes[j][0], boxes[j][1])
for indK, K in enumerate(K_range):
cluster_centers, labels, _ = cluster_algorithm(data_generate, K)
log_Wkbs[b, indK] = \
np.log(calculate_Wk(data_generate, cluster_centers, labels))

for k in range(num_K):
gaps[k] = np.mean(log_Wkbs[:, k]) - log_Wks[k]
sks[k] = np.std(log_Wkbs[:, k]) * np.sqrt(1 + 1.0 / B)

return gaps, sks, log_Wks

参考

  1. K-means怎么选K?
  2. k-means的k值该如何确定?
  3. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001, 63(2): 411-423.

问题描述

Given a paragraph and a list of banned words, return the most frequent word that is not in the list of banned words. It is guaranteed there is at least one word that isn’t banned, and that the answer is unique.

Words in the list of banned words are given in lowercase, and free of punctuation. Words in the paragraph are not case sensitive. The answer is in lowercase.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
Input:
paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
banned = ["hit"]

Output: "ball"

Explanation:
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph.
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"),
and that "hit" isn't the answer even though it occurs more because it is banned.

Note:

  • 1 <= paragraph.length <= 1000.
  • 1 <= banned.length <= 100.
  • 1 <= banned[i].length <= 10.
  • The answer is unique, and written in lowercase (even if its occurrences in paragraph may have uppercase symbols, and even if it is a proper noun.)
  • paragraph only consists of letters, spaces, or the punctuation symbols !?',;.
  • Different words in paragraph are always separated by a space.
  • There are no hyphens or hyphenated words.
  • Words only consist of letters, never apostrophes or other punctuation symbols.

Related Topics: String

原问题: 819. Most Common Word

中文翻译版: 819. Most Common Word

解决方案

题目要求在字符串 paragraph 中找到出现次数最多的单词,并且单词不能是列表 banned 中的单词。解决该题可以设定两个哈希表,第一个哈希表的 key 为列表 banned 中的单词,用于判断单词是否在列表中,第二个哈希表的 key 为字符串 paragraph 中的单词,value 为该单词的出现次数。依次取出字符串 paragraph 中的单词,由于题目要求单词都是小写,所以要将单词转换成小写,然后在第一个哈希表中查找该单词是否存在,如果不存在,将第二个哈希表中该单词的出现次数加1,否则取字符串下一个单词,依次重复这个操作直到遍历完字符串 paragraph 的所有单词,接着根据第二个哈希表找出出现次数最多的单词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
using namespace std;

class Solution {
public:
bool delimter(char ch)
{
if (ch == ' ' || ch == '!' ||
ch == '?' || ch == '\'' ||
ch == ',' || ch == ';' || ch == '.')
return true;
else
return false;
}

void lowercase(string& str)
{
transform(str.begin(), str.end(), str.begin(), ::tolower);
}

string mostCommonWord(string paragraph, vector<string>& banned)
{
unordered_set<string> banned_list;
for (int i=0; i<banned.size(); i++)
banned_list.insert(banned[i]);

int times, max_times;
string str, common_str;
int start, end;
unordered_map<string, int> word_counts;

start = 0;
max_times = 0;
common_str = "";
for (int i=0; i<paragraph.size(); i++) {
if (delimter(paragraph[i])) {
end = i;

str = string(paragraph, start, end-start);
lowercase(str);

if (banned_list.find(str) == banned_list.end()) {
auto it = word_counts.find(str);
if (it == word_counts.end()) {
word_counts.insert(make_pair(str, 1));
times = 1;
} else {
it->second = it->second + 1;
times = it->second;
}
if (times > max_times) {
common_str = str;
max_times = times;
}
}

while (i<paragraph.size() && delimter(paragraph[i]))
i++;

start = i;
}
}

// the last word
end = paragraph.size();
if (start < end) {
str = string(paragraph, start, end-start);
lowercase(str);
if (banned_list.find(str) == banned_list.end()) {
auto it = word_counts.find(str);
if (it == word_counts.end()) {
word_counts.insert(make_pair(str, 1));
times = 1;
} else {
it->second = it->second + 1;
times = it->second;
}
if (times > max_times) {
common_str = str;
max_times = times;
}
}
}

return common_str;
}
};

int main()
{
Solution solu;
string paragraph = "Bob hit a ball, the hit BALL flew far after it was hit.";
vector<string> banned = {"hit"};

cout << solu.mostCommonWord(paragraph, banned) << endl;

return 0;
}

问题描述

Initially, there is a Robot at position (0, 0). Given a sequence of its moves, judge if this robot makes a circle, which means it moves back to the original place.

The move sequence is represented by a string. And each move is represent by a character. The valid robot moves are R (Right), L (Left), U (Up) and D (down). The output should be true or false representing whether the robot makes a circle.

Example 1:

1
2
Input: "UD"
Output: true

Example 2:

1
2
Input: "LL"
Output: false

Related Topics: String

原问题: 657. Judge Route Circle

中文翻译版: 657. 判断路线成圈

解决方案

根据题目描述,可以设定一个二维坐标 (x, y),根据指令做相应的改变:

  • Right: x = x + 1
  • Left: x = x - 1
  • Up: y = y + 1
  • Down: y = y - 1

最后判断二维坐标 (x, y) 是否为原点 (0, 0) 即可。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <iostream>
using namespace std;

class Solution {
public:
bool judgeCircle(string moves)
{
int x = 0, y = 0;

for (int i=0; i<moves.size(); i++) {
switch (moves[i]) {
case 'U':
y++;
break;
case 'D':
y--;
break;
case 'L':
x--;
break;
case 'R':
x++;
break;
}
}

return (x == 0 && y == 0);
}
};

int main()
{
Solution solu;
string str1 = "UD";
cout << str1 << ": " << solu.judgeCircle(str1) << endl;
string str2 = "LL";
cout << str2 << ": " << solu.judgeCircle(str2) << endl;

return 0;
}

问题描述

Roman numerals are represented by seven different symbols: I, V, X, L, C, D and M.

1
2
3
4
5
6
7
8
Symbol       Value
I 1
V 5
X 10
L 50
C 100
D 500
M 1000

For example, two is written as II in Roman numeral, just two one’s added together. Twelve is written as, XII, which is simply X + II. The number twenty seven is written as XXVII, which is XX + V + II.

Roman numerals are usually written largest to smallest from left to right. However, the numeral for four is not IIII. Instead, the number four is written as IV. Because the one is before the five we subtract it making four. The same principle applies to the number nine, which is written as IX. There are six instances where subtraction is used:

  • I can be placed before V (5) and X (10) to make 4 and 9.
  • X can be placed before L (50) and C (100) to make 40 and 90.
  • C can be placed before D (500) and M (1000) to make 400 and 900.

Given a roman numeral, convert it to an integer. Input is guaranteed to be within the range from 1 to 3999.

Example 1:

1
2
Input: "III"
Output: 3

Example 2:

1
2
Input: "IV"
Output: 4

Example 3:

1
2
Input: "IX"
Output: 9

Example 4:

1
2
3
Input: "LVIII"
Output: 58
Explanation: C = 100, L = 50, XXX = 30 and III = 3.

Example 5:

1
2
3
Input: "MCMXCIV"
Output: 1994
Explanation: M = 1000, CM = 900, XC = 90 and IV = 4.

Related Topics: Math, String

原问题: 13. Roman to Integer

中文翻译版: 13. 罗马数字转整数

解决方案

题目难度不大,根据题目描述可以写出代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <iostream>
#include <string>
using namespace std;

class Solution {
public:
int romanToInt(string s)
{
int num = 0;
int j;
char ch;

for (int i=0; i<s.size(); i++) {
ch = s[i];
j = i + 1;
switch (ch) {
case 'I': // value 1
if (j < s.size() && (s[j] == 'V' || s[j] == 'X'))
num -= 1;
else
num += 1;
break;
case 'V': // value 5
num += 5;
break;
case 'X': // value 10
if (j < s.size() && (s[j] == 'L' || s[j] == 'C'))
num -= 10;
else
num += 10;
break;
case 'L': // value 50
num += 50;
break;
case 'C': // value 100
if (j < s.size() && (s[j] == 'D' || s[j] == 'M'))
num -= 100;
else
num += 100;
break;
case 'D': // value 500
num += 500;
break;
case 'M': // value 1000
num += 1000;
break;
}
}

return num;
}
};

int main()
{
Solution solu;
string str1 = "MCMXCIV";

cout << str1 << ": " << solu.romanToInt(str1) << endl;

return 0;
}