袋熊的树洞

Matplotlib备忘录

Posted on 2018-05-09 Edited on 2025-03-02

介绍

本文主要是记录一些 Matplotlib 的使用方法以及注意事项。

Note: 如果没有特别说明，plt 代表 matplotlib.pyplot，np 代表 numpy

1 2	import numpy as np import matplotlib.pyplot as plt

本文代码主要在 Jupyter notebook 中执行，画图之前使用了一下 magic 方法

1	%matplotlib notebook

创建多个子图

一个常用的应用就是在一个 figure 里面画多个图，此时需要 add_subplot 方法或者 subplots 方法。

使用 add_subplot

fig = plt.figure()

ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

ax1.plot(np.random.randn(50).cumsum(), 'k--')
ax2.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)

使用 subplots

fig, axes = plt.subplots(1, 2)

ax1 = axes[0]
ax2 = axes[1]

ax1.plot(np.random.randn(50).cumsum(), 'k--')
ax2.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)

保存图像到文件中

可以使用 plt.savefig 将图像保存到文件中，也可以使用 figure 对象的属性方法 savefig。

data = np.arange(10)
fig, ax = plt.subplots(1, 1)
ax.plot(data)

plt.savefig('1.png')
fig.savefig('2.png')

特定图像

Histogram

直方图 (Histogram) 使用的是 hist 函数进行绘制，bins 选项可以设置bin的个数。

mu, sigma = 100, 15
num_bins = 100
x = mu + sigma * np.random.randn(1000000)

n, bins, patches = plt.hist(x, num_bins, density=True)

plt.title('Histogram')
plt.grid(True)

Pandas备忘录

Posted on 2018-05-06 Edited on 2025-03-02

1. Pandas备忘录系列文章

Pandas备忘录2

2. 介绍

本文主要是记录一些Pandas的使用方法以及注意事项

Note: 如果没有特别说明，pd 指的是 pandas，np 指的是 numpy。

1 2	import pandas as pd import numpy as np

3. 构建DataFrame

构建一个DataFrame方法有很多，这里介绍一部分常用的方法，即

From dict of ndarrays / lists
From a list of dicts

3.1. From dict of ndarrays / lists

输入数据为一个字典，字典的key为列名，字典的value为一个Numpy的数组或者一个list，存储DataFrame一列的值

>>> data = {'one': [1, 2, 3],
...         'two': [3, 2, 1]}
>>> pd.DataFrame(data)
   one  two
0    1    3
1    2    2
2    3    1

3.2. From a list of dicts

输入数据为一个list，每个元素为一个字典，存储DataFrame一行的值，字典的key为列名，value为一个DataFrame中一个元素。

>>> data = [{'one': 1, 'two': 3},
...         {'one': 2, 'two': 2},
...         {'one': 3, 'two': 1}]
>>> pd.DataFrame(data)
   one  two
0    1    3
1    2    2
2    3    1

4. Series索引

Series索引类似Numpy的数组索引，除了可以使用 integer 作为索引值，还可以使用 index 作为索引值

>>> obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
>>> obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
>>> obj['b']
1.0
>>> obj[1]
1.0
>>> obj[:2]
a    0.0
b    1.0
dtype: float64
>>> obj[obj < 2]
a    0.0
b    1.0
dtype: float64

# 如果使用 label 索引，则索引区间为闭区间
>>> obj['b':'c']
b    1.0
c    2.0
dtype: float64

5. DataFrame索引

先贴张DataFrame索引方法的表格，摘录自《Python for Data Analysis》。

Type	Notes
df[val]	Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values bases on some criterion)
df.loc[val]	Selects single row or subset of rows from the DataFrame by label
df.loc[:, val]	Selects single column or subset of columns by label
df.loc[val1, val2]	Select both rows and columns by label
df.iloc[where]	Selects single row of subsets of rows from the DataFrame by integer position
df.iloc[:, where]	Selects single column or subset of columns by integer position
df.iloc[where_i, where_j]	Select both rows and columns by integer position
df.at[label_i, label_j]	Select a single scalar value by row and column label
df.iat[i, j]	Select a single scalar value by row and column position (integers)

先创建一个数据

>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)),
...                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
...                     columns=['one', 'two', 'three', 'four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

5.1. df[]语法

# 利用单个label选择单列
>>> data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64


# 利用多个label选择多列，可以改变列顺序
>>> data[['three', 'one']]
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12


# 利用boolean数组选择多行
>>> bools = np.array([False, True, False, True])
>>> bools
array([False,  True, False,  True])
>>> data[bools]
          one  two  three  four
Colorado    4    5      6     7
New York   12   13     14    15


# 利用切片(slice)选择多行，类似Numpy的语法
>>> data[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7


# 利用boolean DataFrame选择数据
>>> data < 5
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False
>>> data[data < 5] = 0
>>> data
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

5.2. df.loc语法

df.loc[] 索引值为 axis labels

# 使用 df.loc[val] 选择行
>>> data.loc['Utah']
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
>>> data.loc[['Utah', 'Ohio']]
      one  two  three  four
Utah    8    9     10    11
Ohio    0    1      2     3
# 如果使用 label 索引，则索引区间为闭区间
>>> data.loc[:'Utah']
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11


# 使用 df.loc[:, val] 选择列
>>> data.loc[:, 'one']
Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int64
>>> data.loc[:, ['one', 'two']]
          one  two
Ohio        0    1
Colorado    4    5
Utah        8    9
New York   12   13
# 如果使用 label 索引，则索引区间为闭区间
>>> data.loc[:, :'two']
          one  two
Ohio        0    1
Colorado    4    5
Utah        8    9
New York   12   13


# 使用 df.loc[val1, val2] 选择多行多列
>>> data.loc[['Colorado', 'Ohio'], ['two', 'three']]
          two  three
Colorado    5      6
Ohio        1      2

5.3. df.iloc语法

df.iloc[] 索引值为 integers

# 使用 df.iloc[where] 选择行
>>> data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
>>> data.iloc[[2,1]]
          one  two  three  four
Utah        8    9     10    11
Colorado    4    5      6     7
>>> data.iloc[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7


# 使用 df.iloc[:, where] 选择列
>>> data.iloc[:, 1]
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
>>> data.iloc[:, [2, 0]]
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12
>>> data.iloc[:, :2]
          one  two
Ohio        0    1
Colorado    4    5
Utah        8    9
New York   12   13


# 使用 df.iloc[where_i, where_j] 选择多行多列
>>> data.iloc[2, :2]
one    8
two    9
Name: Utah, dtype: int64
>>> data.iloc[:2, :2]
          one  two
Ohio        0    1
Colorado    4    5

5.4. Reset Index

可以使用DataFrame的reset_index()函数将index变为DataFrame的一列，原index替换为递增的整数索引

>>> data.reset_index()
      index  one  two  three  four
0      Ohio    0    1      2     3
1  Colorado    4    5      6     7
2      Utah    8    9     10    11
3  New York   12   13     14    15

这个函数还有一个用处就是可以给每一行编一个号，如果index已为递增的整数索引，再调用reset_index()函数时，就会多出一列，内容为递增整数，相当于给每一行编个号。

>>> data2 = data.reset_index()
>>> data2
      index  one  two  three  four
0      Ohio    0    1      2     3
1  Colorado    4    5      6     7
2      Utah    8    9     10    11
3  New York   12   13     14    15
>>> data2.reset_index()
   level_0     index  one  two  three  four
0        0      Ohio    0    1      2     3
1        1  Colorado    4    5      6     7
2        2      Utah    8    9     10    11
3        3  New York   12   13     14    15

列level_0可以用于数据编号。

6. Series数据映射

已知数据

>>> data = pd.DataFrame({'Age': [22, 38, 26, 35, 35],
...                      'Sex': ['male', 'female', 'female', 'female', 'male']})
>>> data
   Age     Sex
0   22    male
1   38  female
2   26  female
3   35  female
4   35    male

Sex 那一列的数据取值有两种，分别为 female 和 male，此时想要作一个数据映射

1 2	female -> 0 male -> 1

可以使用 map() 函数，输入映射参数 (映射用一个字典表示)

>>> sex_to_num = {'female': 0, 'male': 1}
>>> data['Sex'] = data['Sex'].map(sex_to_num)
>>> data
   Age  Sex
0   22    1
1   38    0
2   26    0
3   35    0
4   35    1

7. 缺失值

Note: 本节所述的 NA 代表

1	from numpy import nan as NA

7.1. 缺失值分析

首先分析数据缺失值情况

Series

对于 Series，可以使用 isnull() 函数判断 Series 中的数据是否为缺失值

>>> data = pd.Series([1, NA, 3.5, NA, 7])
>>> data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
>>> data.isnull()
0    False
1     True
2    False
3     True
4    False
dtype: bool

要判断有多少个缺失值，可以使用 sum() 判断 isnull() 返回的 Boolean 数组中有多少个 True。

1 2	>>> data.isnull().sum() 2

DataFrame

对于 DataFrame，同样可以使用 isnull() 函数判断数据是否为缺失值

>>> data = pd.DataFrame([[1., 6.5, 3.], [1, NA, NA],
...                      [NA, NA, NA], [NA, 6.5, 3.]])
>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> data.isnull()
       0      1      2
0  False  False  False
1  False   True   True
2   True   True   True
3   True  False  False

如果要知道每一列的缺失值是多少，可以使用 info() 函数，或者将 isnull() 与 sum() 配合起来

>>> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
0    2 non-null float64
1    2 non-null float64
2    2 non-null float64
dtypes: float64(3)
memory usage: 176.0 bytes
>>> data.isnull().sum()
0    2
1    2
2    2
dtype: int64

如果要知道缺失值总数，可以作两次 sum()

1 2	>>> data.isnull().sum().sum() 6

7.2. 缺失值补全

处理缺失值的一个方法就是使用插值或者自设的值补全缺失值。首先构造一个有缺失值的 DataFrame

>>> df = pd.DataFrame(np.random.randn(7, 3))
>>> df.iloc[:4, 1] = NA
>>> df.iloc[:2, 2] = NA
>>> df
          0         1         2
0  1.309480       NaN       NaN
1 -0.542253       NaN       NaN
2 -1.129503       NaN -0.396921
3  1.172466       NaN -1.114420
4  0.893893  0.572520  0.648532
5  0.293066  1.875362 -0.426759
6 -0.389516  0.379184  0.759198

使用自设的值补全缺失值

这里主要使用函数 fillna()，传入一个值，则缺失值就被设为该值

# 缺失值使用0补全
>>> df.fillna(0)
          0         1         2
0  1.309480  0.000000  0.000000
1 -0.542253  0.000000  0.000000
2 -1.129503  0.000000 -0.396921
3  1.172466  0.000000 -1.114420
4  0.893893  0.572520  0.648532
5  0.293066  1.875362 -0.426759
6 -0.389516  0.379184  0.759198

如果传入的是一个字典，则可以对每一列的缺失值设定不同的补全值

# 第二列缺失值用0.5补全，第三列缺失值用0补全
>>> df.fillna({1: 0.5, 2: 0})
          0         1         2
0  1.309480  0.500000  0.000000
1 -0.542253  0.500000  0.000000
2 -1.129503  0.500000 -0.396921
3  1.172466  0.500000 -1.114420
4  0.893893  0.572520  0.648532
5  0.293066  1.875362 -0.426759
6 -0.389516  0.379184  0.759198

插值补全缺失值

这里主要使用函数 fillna() 的 method 参数，例子中数据由于某列第一个元素有缺失，所以方法选择 backfill。

>>> df.fillna(method='backfill')
          0         1         2
0  1.309480  0.572520 -0.396921
1 -0.542253  0.572520 -0.396921
2 -1.129503  0.572520 -0.396921
3  1.172466  0.572520 -1.114420
4  0.893893  0.572520  0.648532
5  0.293066  1.875362 -0.426759
6 -0.389516  0.379184  0.759198

7.3. 缺失值删除

删除有缺失值的行

DataFrame的dropna函数默认是删除有缺失值的行

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})
>>> df
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
>>> df.dropna()
     name        toy       born
1  Batman  Batmobile 1940-04-25

删除有缺失值的列

设定dropna函数的参数axis为1可以删除有缺失值的列

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})
>>> df
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
>>> df.dropna(axis=1)
       name
0    Alfred
1    Batman
2  Catwoman

8. 重复值

已知DataFrame

>>> data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
...                      'k2': [1, 1, 2, 3, 3, 4, 4]})
>>> data
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

8.1. 删除重复值

可以使用函数drop_duplicates()删除重复值

>>> data.drop_duplicates()
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

函数drop_duplicates()默认考虑的是全部列，也可以设定某些列来判断是否重复

>>> data.drop_duplicates(['k1'])
    k1  k2
0  one   1
1  two   1

9. 排序

已知DataFrame

>>> df = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
>>> df
   a  b
0  0  4
1  1  7
2  0 -3
3  1  2

9.1. 根据某些列进行排序

排序使用函数sort_values()，如果要根据某些列进行排序，可以设定by=参数

>>> df.sort_values(by='a')
   a  b
0  0  4
2  0 -3
1  1  7
3  1  2
>>> df.sort_values(by=['a', 'b'])
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

10. GroupBy

已知DataFrame

>>> df = pd.DataFrame({'key1':['a','a','b','b','a'],
...                    'key2':['one','two','one','two','one'],
...                    'data1':np.random.randn(5),
...                    'data2':np.random.randn(5)})
>>> df
      data1     data2 key1 key2
0  2.462027  0.054159    a  one
1  0.283423 -0.658160    a  two
2 -0.969307 -0.407126    b  one
3 -0.636756  1.925338    b  two
4 -0.408266  1.833710    a  one

10.1. 查看分组名

DataFrame分了组后，想知道每个分组的名字，可以写为

>>> df.groupby('key1').groups
{'a': Int64Index([0, 1, 4], dtype='int64'), 'b': Int64Index([2, 3], dtype='int64')}
>>> df.groupby('key1').groups.keys()
dict_keys(['a', 'b'])

10.2. 分组计算和以及平均值

如果想要根据列key1的值分组计算data1的和，可以写为

>>> df['data1'].groupby(df['key1']).sum().reset_index()
  key1     data1
0    a  2.337185
1    b -1.606063

或者

>>> df.filter(['data1', 'key1']).groupby('key1', as_index=False).sum()
  key1     data1
0    a  2.337185
1    b -1.606063

这个as_index=False使得key1的值不作为index。计算平均值只需要将sum()换成mean()即可。

11. Merge

11.1. 笛卡尔乘积

两个集合$X$和$Y$的笛卡尔乘积(Cartesian product)，表示为$X \times Y$，是指第一个对象是$X$的成员而第二个对象是$Y$的所有可能有序对的其中一个成员。举个例子，假设集合$A = { a,b }$，集合$B = { 0, 1, 2 }$，则两个集合的笛卡尔积为 ${ (a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2) }$。使用Pandas的merge函数可以实现两个DataFrame的笛卡尔积。

>>> df1 = pd.DataFrame({'A': ['a', 'b', 'c']})
>>> df1
   A
0  a
1  b
2  c
>>> df2 = pd.DataFrame({'B': [0, 1, 2]})
>>> df2
   B
0  0
1  1
2  2
>>> pd.merge(df1.assign(foo=0), df2.assign(foo=0), on=['foo']).drop(columns=['foo'])
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2

主要是实现思想是增加一列foo，值设为0，然后使用merge函数进行合并。

12. 读取数据

12.1. 读取csv文件

读取csv数据文件需要用到 read_csv() 函数

Numpy备忘录

Posted on 2018-05-03 Edited on 2025-03-02

介绍

本文主要是记录一些Numpy的使用方法以及注意事项。

Note: 如果没有特别说明，np 指的是 numpy，代表导入的 numpy

1	import numpy as np

PS: 网上找到一份Numpy的CheatSheet，内容不错，感兴趣的可以去下载：Numpy CheatSheet

转换ndarray数据类型

如果想转换ndarray的数据类型，可以使用ndarray的 astype 方法

>>> arr = np.array([1, 2, 3, 4, 5])
>>> arr.dtype
dtype('int64')
>>> float_arr = arr.astype(np.float64)
>>> float_arr.dtype
dtype('float64')

Note: 调用 astype 方法会生成新的数组，因此需要赋值到一个变量上。

切片不会生成新的数组

对数组进行切片后，返回的数组并不是原始数组的拷贝，只是一个对原始数组的引用，如果对切片后的数组进行数据修改，原始数组相应的位置数据会被修改。

>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> new_arr = arr[5:8]
>>> new_arr
array([5, 6, 7])

>>> new_arr[:] = 12
>>> new_arr
array([12, 12, 12])
>>> arr
array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

如果想要切片后的数组是原始数组的拷贝，需要调用 copy() 方法。

1	>>> new_arr = arr[5:8].copy()

Axis编号

调用一些Numpy函数时会遇到设置 axis 参数，该参数可以设为 0 或 1 ，对于这两个值表示的意义，可以参考下图

也就是说axis 0 代表行方向，axis 1 代表列方向。例如，我们使用 mean() 计算矩阵的平均值时，A.mean(axis=0) 代表每个平均值是沿着行方向计算可得，A.mean(axis=1) 代表每个平均值是沿着列方向计算可得。

>>> A = np.random.randn(2, 3)
>>> A
array([[-0.40797393,  0.24059956, -1.57582642],
       [ 0.31626161, -0.07033558,  0.58346107]])
>>> A.mean(axis=0)
array([-0.04585616,  0.08513199, -0.49618267])
>>> A.mean(axis=1)
array([-0.58106693,  0.27646237])

A.mean(axis=0) 大小为矩阵A列的个数，A.mean(axis=1)大小为矩阵A行的个数。

统计Boolean数组中True的个数

Boolean值通常可以转换成 0 (False) 或 1 (True)，因此可以使用 sum() 函数统计Boolean数组中 True 的个数

>>> bools = np.array([False, True, True, False, False])
>>> bools.sum()
2
>>> np.sum(bools)
2

既然可以知道数组中 True 的个数，自然也可以知道 True 所占的比例，此时可以用 mean() 函数进行计算

>>> bools.mean()
0.4
>>> np.mean(bools)
0.4

终端快捷键

Posted on 2018-05-03 Edited on 2025-03-02

环境

OS: OS X 10.11.6

Shell: bash and zsh

Other: iTerm2

快捷键列表

快捷键	功能
Option + Left or Right	光标按单词前移或后移
Control + F or B	光标向前或向后移动一个字符
Control + D	删除当前光标的字符
Control + H	删除光标之前的字符
Control + W	删除光标之前的单词
Command + A	移动光标到行首
Command + E	移动光标到行尾巴
Control + U	清除整行内容
Control + K	删除光标之后的内容
Control + C	Kill掉当前运行的程序
Control + Z	暂停当前运行程序，要恢复运行使用 `fg process_name`

快捷键详细信息

Option + Left or Right

如果使用iTerm2作为终端模拟器时，该快捷键不是所预想的功能，此时需要进行设置，这里贴出[4]提供的教程

Step 1: 在 Profiles-Keys 下设置 Left ⌥ key to act as Esc+

Step 2: 添加一个快捷键 ⌥ ←，快捷键设置内容为

Keyboard Shortcut: ⌥ ←
Action: Send Escape Sequence
Esc+: b

Step 3: 添加一个快捷键 ⌥ →，快捷键设置内容为

Keyboard Shortcut: ⌥ →
Action: Send Escape Sequence
Esc+: f

参考

Python Snippets Part 1

Posted on 2018-05-02 Edited on 2025-03-02

1. 判断文件或目录是否存在
2. ParameterGrid
3. 批量下载图片
4. 遍历文件夹中所有文件
5. 计算函数运行时间
6. 判断对象是否可迭代
7. 判断操作系统类型

1. 判断文件或目录是否存在

创建一个目录和一个文件

$ mkdir dir1 && touch file1.txt
$ ls -l
total 0
drwxr-xr-x  2 luowanqian  wheel  68  5  2 23:21 dir1
-rw-r--r--  1 luowanqian  wheel   0  5  2 23:21 file1.txt

使用 os.path.exists() 可以判断文件或目录是否存在，但是不能判断是该路径是一个文件还是目录，要进一步判断，需要使用 os.path.isfile()，如果该路径是一个文件，则返回 True，否则返回 False，当然，也可以直接使用 os.path.isfile() 判断文件是否存在。测试代码如下：

>>> import os
>>> os.path.exists('dir1')
True
>>> os.path.exists('file1.txt')
True
>>> os.path.isfile('dir1')
False
>>> os.path.isfile('file1.txt')
True

>>> os.path.exists('dir2')
False
>>> os.path.exists('file2.txt')
False
>>> os.path.isfile('dir2')
False
>>> os.path.isfile('file2.txt')
False

目录 dir2 和文件 file2.txt 均不存在，所以函数 os.path.exists() 和 os.path.isfile() 均返回 False。

2. ParameterGrid

机器学习算法最常见的调参方法是网格搜索，需要将多组参数进行组合，Scikit-learn提供了一个类 ParametGrid 可以生成所有的参数组合，这里提取其关键代码单独写成一个生成器：

from itertools import product


def parameters_grid(parameter_map):
    items = sorted(parameter_map.items())
    if not items:
        yield {}
    else:
        keys, values = zip(*items)
        for v in product(*values):
            params = dict(zip(keys, v))
            yield params


if __name__ == '__main__':
    parameter_map = {'a': [1, 2], 'b': [True, False]}
    for params in parameters_grid(parameter_map):
        print(params)

代码运行结果

{'a': 1, 'b': True}
{'a': 1, 'b': False}
{'a': 2, 'b': True}
{'a': 2, 'b': False}

3. 批量下载图片

这里使用requests库批量下载图片，为了加快下载速度，还实现了多线程下载，同时为了避免一次下载失败，脚本支持自动重试下载。没有处理具体的异常，只是捕获异常后输出异常信息。

import os
import requests
from retrying import retry
from io import BytesIO
from PIL import Image
import progressbar
import concurrent.futures as concurrent


@retry(stop_max_attempt_number=10)
def download(image_url, image_path):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(image_path)


if __name__ == '__main__':
    base_url = 'http://www.baidu.com'
    num_images = 3
    suffix = '.jpg'
    image_dir = 'images'

    num_workers = 4

    if not os.path.isdir(image_dir):
        os.mkdir(image_dir)

    with concurrent.ThreadPoolExecutor(max_workers=num_workers) as executor:
        image_urls = []
        image_paths = []
        for image_id in range(num_images):
            url = base_url + '/' + str(image_id + 1) + suffix
            file_path = os.path.join(image_dir, str(image_id + 1) + suffix)
            image_urls.append(url)
            image_paths.append(file_path)

        tasks = {
            executor.submit(download, url, file_path):
                           (url, file_path) for url, file_path in zip(image_urls, image_paths)
        }

        i = 0
        total = len(image_urls)
        pbar = progressbar.ProgressBar(max_value=total).start()
        for task in concurrent.as_completed(tasks):
            url, file_path = tasks[task]
            try:
                task.result()
                i = i + 1
                pbar.update(i)
            except Exception as exc:
                print('{} generated an exception: {}'.format(url, exc))
        pbar.finish()

4. 遍历文件夹中所有文件

首先目录结构如下：

$ tree test
test
├── 1.txt
├── 2.txt
└── test2
    ├── 3.txt
    └── 4.txt

使用os.walk()遍历test目录，代码如下：

root_dir = '/tmp/test'
for root, dirs, files in os.walk(root_dir, topdown=True):
    for name in files:
        print(os.path.join(root, name))
    for name in dirs:
        print(os.path.join(root, name))

得到结果如下：

/tmp/test/1.txt
/tmp/test/2.txt
/tmp/test/test2
/tmp/test/test2/3.txt
/tmp/test/test2/4.txt

如果设置topdown=False，得到结果如下：

/tmp/test/test2/3.txt
/tmp/test/test2/4.txt
/tmp/test/1.txt
/tmp/test/2.txt
/tmp/test/test2

5. 计算函数运行时间

这里使用装饰器来计算函数运行时间

import time
from functools import wraps


def timethis(func):
    """
    Decorator that reports the execution time
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(func.__name__, end-start)

        return result

    return wrapper


@timethis
def countdown(n):
    while n > 0:
        n -= 1

6. 判断对象是否可迭代

def isiterable(obj):
    try:
        obj = iter(obj)
    except:
        return False
    else:
        return True


if __name__ == "__main__":
    a = [1, 2]
    b = 3
    print(f"'a' is {'iterable' if isiterable(a) else 'not iterable'}")
    print(f"'b' is {'iterable' if isiterable(b) else 'not iterable'}")

执行结果

1 2	'a' is iterable 'b' is not iterable

7. 判断操作系统类型

可以使用 sys.platform 判断当前是什么操作系统。常见的操作系统，其返回值如下

System	platform value
AIX	`'aix'`
Linux	`'linux'`
Windows	`'win32'`
Windows/Cygwin	`'cygwin'`
macOS	`'darwin'`

官方推荐使用 startswith() 判断系统类型 (见 sys.platform)，这里贴一段测试代码

import sys


def identify_platform():
    platform = sys.platform
    if platform.startswith("freebsd"):
        return "freebsd"
    elif platform.startswith("linux"):
        return "linux"
    elif platform.startswith("aix"):
        return "aix"
    elif platform.startswith("win"):
        return "windows"
    elif platform.startswith("darwin"):
        return "macos"
    else:
        return "unknown"


if __name__ == "__main__":
    print(f"Platform: {identify_platform()}")

VGG网络实现

Posted on 2018-04-30 Edited on 2025-03-02

网络结构

关于 VGG 的详细内容，可以去看论文 Very Deep Convolutional Networks for Large-Scale Image Recognition，这里贴出网络结构图。

卷积层表示为 conv<receptive field size>-<number of channels>，卷积步长 (stride) 为 1，填充 (padding) 大小为 1，Pooling层的窗口大小为 2x2，步长 (stride) 为 2。为了显示简洁，图中未显示ReLU层。从图中可以看出网络输入的图片的大小为 224x224x3，经过每一层后，大小变化如下所示 (以下为VGG16网络，也就是图中的网络 D)

网络层	大小
输入层	224x224x3
conv3-64	224x224x64
conv3-64	224x224x64
maxpool	112x112x64
conv3-128	112x112x128
conv3-128	112x112x128
maxpool	56x56x128
conv3-256	56x56x256
conv3-256	56x56x256
conv3-256	56x56x256
maxpool	28x28x256
conv3-512	28x28x512
conv3-512	28x28x512
conv3-512	28x28x512
maxpool	14x14x512
conv3-512	14x14x512
conv3-512	14x14x512
conv3-512	14x14x512
maxpool	7x7x512
FC	1x1x4096
FC	1x1x4096
FC	1x1x1000

PyTorch实现

VGG 网络的 PyTorch 实现可以在 vgg.py 中找到，里面实现了网络 A, B, D, E 即 VGG11, VGG13, VGG16 以及 VGG19，同时还有相对应的 Batch Normalization 版本。如果要将 VGG 网络应用到其他大小输入的图片，主要修改的参数就是最后几个全连接层的大小即可，也就是只用修改类 VGG 中 classifier 属性即可

class VGG(nn.Module):
	def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
        if init_weights:
            self._initialize_weights()

如果输入图片来自于CIFAR-10数据集，即大小为 32x32，类别数为10，经过网络的最后一个Pooling层后，输出大小为 1x1x512，因此修改类的 classifier 属性为

self.classifier = nn.Sequential(
    nn.Linear(512 * 1 * 1, 512),
    nn.ReLU(True),
    nn.Dropout(),
    nn.Linear(512, 512),
    nn.ReLU(True),
    nn.Dropout(),
    nn.Linear(512, num_classes),
)

同时在使用函数生成对应版本的网络时，设置函数参数 num_classes=10。

参考

H5py支持Parallel HDF5

Posted on 2018-04-28 Edited on 2025-03-02

环境

操作系统：OS X 10.11.6

Python版本：3.6.5

安装HDF5

使用 Homebrew 安装 HDF5，注意要开启 mpi 编译选项

1	$ brew install hdf5 --with-mpi

安装mpi4py

官网教程在使用 Parallel HDF5 时会用到 mpi4py 这个包，直接使用 pip 安装即可

1	$ pip3 install mpi4py

Note: 安装遇到一个问题是，如果之前安装过这个包，现在卸载再安装这个包时，pip 不会重新编译这个包，此时需要 --no-cache-dir 这个选项。

1	$ pip3 install --no-cache-dir mpi4py

安装h5py

直接使用 pip 安装 h5py 是不会开启 Parallel HDF5 的，需要添加一些编译选项，参考官网安装教程，运行命令

1
2
3

$ export CC=mpicc
$ export HDF5_MPI="ON"
$ pip3 install --no-binary=h5py h5py

同前面所诉，如果需要重新编译这个包，需要用到 --no-cache-dir 这个选项

1	$ pip3 install --no-cache-dir --no-binary=h5py h5py

图像数据转换成LMDB文件

Posted on 2018-04-28 Edited on 2025-03-02

介绍

在使用 Caffe 时，一个经常使用的数据输入来源就是LMDB数据库，通常我们手头的数据是一堆图片，此时需要将图片数据放入到LMDB数据库中。Caffe 有自个的转换程序，是一个用 C++ 编写的程序，需要编译，由于本文主要使用 Python语言，因此使用[3]提供的包来做数据转换。

Note: 相关的代码和图片数据在 GitHub

LMDB读写

数据描述

已有 10 张图片，放在目录 data 中，图片文件名列表为：

$ ls data | sort -n
1.png
2.png
3.png
4.png
5.png
6.png
7.png
8.png
9.png
10.png

每张图片都有一个类别，存在文件 labels.csv

$ cat labels.csv
id,label
1,frog
2,truck
3,truck
4,deer
5,automobile
6,automobile
7,bird
8,horse
9,ship
10,cat

文件中第一列是图片文件名 (不包含后缀名)，第二列是图片的类别。

LMDB写入

import numpy as np
import pandas as pd
from skimage import io
import matplotlib.pyplot as plt
import os
import lmdb
import caffe


def make_datum(image, label, channels, height, width):
    datum = caffe.proto.caffe_pb2.Datum()
    datum.channels = channels
    datum.label = int(label)
    datum.height = height
    datum.width = width
    datum.data = image.tobytes()

    return datum


# data path and lmdb path
dataset_path = './data'
label_file = 'labels.csv'
lmdb_path = 'cifar10_lmdb'

# labels mapping
labels_mapping = {'airplane': 0, 'automobile': 1,
                  'bird': 2, 'cat': 3, 'deer': 4, 'dog': 5,
                  'frog': 6, 'horse': 7, 'ship': 8, 'truck': 9}
classes = {}
for key in labels_mapping:
    classes[labels_mapping[key]] = key

# load data
df = pd.read_csv(label_file)
df['label'] = df['label'].map(labels_mapping)
images = list(df.id)
labels = list(df.label)

# write data to LMDB
map_size = 1e6
batch_size = 4

count = 0
lmdb_env = lmdb.open(lmdb_path, map_size=map_size)
lmdb_txn = lmdb_env.begin(write=True)

for image_id, label in zip(images, labels):
    count = count + 1
    image_file = os.path.join(dataset_path, str(image_id) + '.png')
    image = io.imread(image_file)
    height, width, channels = image.shape
    datum = make_datum(image, label, channels, height, width)
    str_id = '{:08}'.format(count)
    lmdb_txn.put(str_id, datum.SerializeToString())

    if count % batch_size == 0:
        lmdb_txn.commit()
        lmdb_txn = lmdb_env.begin(write=True)

lmdb_txn.commit()
lmdb_env.close()

LMDB读取

lmdb_env = lmdb.open(lmdb_path)
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
datum = caffe.proto.caffe_pb2.Datum()

count = 0
for key, value in lmdb_cursor:
    datum.ParseFromString(value)
    label = datum.label
    height = datum.height
    width = datum.width
    channels = datum.channels
    data = datum.data
    count = count + 1

    if count == 2:
        image = np.frombuffer(data, dtype=np.uint8)
        image = np.reshape(image, (height, width, channels))
        print('Label: {}, Class: {}'.format(label, classes[label]))
        plt.imshow(image)
        break

print('Number of items: {}'.format(count))

lmdb_env.close()

参考

H5py备忘录

Posted on 2018-04-27 Edited on 2025-03-02

本文主要是记录一些h5py的使用方法。

Note: 如果没有特别说明，np 指的是 numpy，代表导入的 numpy

1	import numpy as np

存储浮点数采用单精度浮点数

在HDF5文件中存储浮点数时，可以选择单精度浮点数和双精度浮点数，常见是用单精度浮点数来存储浮点数，相比于双精度浮点数，单精度浮点数存储空间为双精度浮点数的一半，这样可以缩小一半的存储空间。不同于文件存储，在内存中需要使用双精度浮点数来保证计算的准确性，因此一个通常的操作是，在内存中使用双精度浮点数，然后存储到HDF5文件时，使用单精度浮点数，两者的使用只需要进行数据类型转换。

在内存中，我们使用Numpy创建一个数组，数据类型为双精度浮点数

>>> import numpy as np
>>> bigdata = np.ones((100, 1000))
>>> bigdata.dtype
dtype('float64')
>>> bigdata.shape
(100, 1000)

直接使用赋值方式将数组存储到HDF5文件中，存储的浮点数为双精度浮点数

>>> with h5py.File('big1.hdf5', 'w') as f1:
...     f1['big'] = bigdata
>>> f1 = h5py.File('big1.hdf5')
>>> f1['big'].dtype
dtype('float64')

文件大小为 783K

1 2	$ ls -lh big1.hdf5 -rw-r--r-- 1 luowanqian staff 783K 4 29 10:44 big1.hdf5

我们可以使用 create_dataset 函数来指定存储单精度浮点数

>>> with h5py.File('big2.hdf5', 'w') as f2:
...     f2.create_dataset('big', data=bigdata, dtype=np.float32)
>>> f2 = h5py.File('big2.hdf5')
>>> f2['big'].dtype
dtype('float32')

文件大小为 393K

1 2	$ ls -lh big2.hdf5 -rw-r--r-- 1 luowanqian staff 393K 4 29 11:02 big2.hdf5

读取数据时进行数据转换

假设有个存储单精度浮点数的HDF5文件，我们想将数据读入到内存时是双精度浮点数。HDF5文件的数据如下

>>> import numpy as np
>>> import h5py
>>> bigdata = np.ones((100, 1000))
>>> with h5py.File('big.hdf5', 'w') as f:
...     f.create_dataset('big', data=bigdata, dtype=np.float32)
>>> f = h5py.File('big.hdf5')
>>> dset = f['big']
>>> dset.dtype
dtype('float32')
>>> dset.shape
(100, 1000)

方案1

使用 np.empty 创建一个空数组，然后使用 read_direct 函数

>>> out = np.empty((100, 1000), dtype=np.float64)
>>> dset.read_direct(out)
>>> out.dtype
dtype('float64')

如果不想读取全部数据，可以设置函数 read_direct 的 source_sel 和 dest_sel 参数。假设要将 dset[0, :] 的数据读入到 out[50, :]

1 2	>>> out = np.empty((100, 1000), dtype=np.float64) >>> dset.read_direct(out, source_sel=np.s_[0, :], dest_sel=np.s_[50, :])

其中使用 np.s_ 返回是 Numpy 的 slice 对象，该对象包含索引信息。如果省略 dset_sel 参数，则采用类似 Numpy 的广播规则进行赋值

1 2	>>> out = np.empty((100, 50), dtype=np.float32) >>> dset.read_direct(out, source_sel=np.s_[:, 0:50])

优势

如果要读取多次同样大小的数据时，使用 read_direct 可以节省很多时间，因为只需要申请一次空间，后面数据读入直接覆盖到这个空间。做一个Benchmark：

import h5py
import numpy as np
from timeit import timeit


filename = 'test.hdf5'
n = 10000

f = h5py.File(filename, 'w')
dset = f.create_dataset('perftest', (n, n), dtype=np.float32)
dset[:] = np.random.random(n)
out = np.empty((n, 500), dtype=np.float32)


def time_simple():
    dset[:, 0:500].mean(axis=1)


def time_direct():
    dset.read_direct(out, np.s_[:, 0:500])
    out.mean(axis=1)


print('Time simple: {}'.format(timeit(time_simple, number=500)))
print('Time direct: {}'.format(timeit(time_direct, number=500)))

f.close()

运行结果

1 2	Time simple: 41.33699381200131 Time direct: 39.005692337988876

方案2

使用 Dataset.astype 这个上下文管理器

>>> with dset.astype('float64'):
...     out = dset[...]
>>> out.shape
(100, 1000)
>>> out.dtype
dtype('float64')

当然，也适合读取部分数据

>>> with dset.astype('float64'):
...     out = dset[0, :]
>>> out.dtype
dtype('float64')

Non-Maximum Suppression

Posted on 2018-04-25 Edited on 2025-03-02

介绍

Non-Maximum Suppression，简称NMS，在计算视觉领域有着非常重要的应用，主要应用在冗余的检测框的去除，例如在人脸检测应用中，检测出多个人脸框，此时需要去除冗余的检测框，保留最好的一个，还有在目标检测算法中会遇到一个物体有多个检测框，此时也需要去除冗余的检测框。

原理

利用参考[1]中提供的例子来简单阐述算法的流程。已知图片中有一辆汽车

目标检测算法定位该汽车的位置时找出了一堆检测框，此时我们需要去除冗余的检测框。假设目标算法找到了6个检测框，而且算法还提供了每个框中内容属于汽车的概率或者得分 (在RCNN中，使用SVM计算检测框属于该类别的得分)，NMS方法首先根据得分大小对检测框进行排序，假设从小到大的排序为 A, B, C, D, E, F。

从最大得分的检测框 F 开始，分别判断 A~E 与 F 的重叠度IoU是否大于某个设定的阈值。
假设检测框 B、D 与 F 的重叠度超过阈值，那么就抛弃 B、D，并将检测框 F 标记为要保留的检测框。
第2步去掉 B 和 D 后，剩余检测框 A、C、E，接着在剩下的检测框中选择得分最大检测框 E，然后判断 E 和 A、C 的重叠度，如果重叠度大于设定阈值，那就抛弃该检测框，否则留到下一轮的筛选过程，并将检测框 E 标记为要保留的检测框。

重复步骤 3，直到剩余待筛选的框个数为 0。

实现

关于单类别的NMS的实现，网上已经有实现好的，这里贴出 Ross Girshick (RCNN提出者) 写的Python实现 py_cpu_nms.py，并且加了个人的标注。

import numpy as np


def py_cpu_nms(dets, thresh):
    """Pure Python NMS baseline."""
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]
    scores = dets[:, 4]

    # 每个框 (bounding box) 的面积
    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    # 根据得分 (score) 的大小进行降序排序
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        # 保留剩余框中得分最高的那个
        i = order[0]
        keep.append(i)

        # 计算相交区域位置，左上以及右下
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        # 计算相交区域面积
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)

        # 计算IoU，即 重叠面积 / (框1面积 + 框2面积 - 重叠面积)
        inter = w * h
        ovr = inter / (areas[i] + areas[order[1:]] - inter)

        # 保留IoU小于阈值的框
        inds = np.where(ovr <= thresh)[0]

        # 因为ovr数组的长度比order长度小1，所以这里要将所有下标后移一位
        order = order[inds + 1]

    return keep

关于该代码的使用，我写了一个简单的测试脚本，在 GitHub，测试了一张图片，实现效果如下：

参考

非极大值抑制（Non-Maximum Suppression，NMS）