袋熊的树洞

LeetCode 88 - Merge Sorted Array

Posted on 2018-03-10 Edited on 2025-03-02

问题描述：

Given two sorted integer arrays nums1 and nums2, merge nums2 into nums1 as one sorted array.

Note:
You may assume that nums1 has enough space (size that is greater or equal to m + n) to hold additional elements from nums2. The number of elements initialized in nums1 and nums2 are m and n respectively.

题目中说数组nums1大小为 m，数组nums2的大小为 n，要将数组nums2中的元素全部放到数组nums1中且保证数组nums1有序（这里有序假设是从小到大）。

时间复杂度为 O(nm) 的解法

初看到这个题目，一个最直观的想法就是设定两个指针分别指向两个数组开始的元素（令指针p1指向数组nums1，p2指向数组p2），然后从头开始遍历两个数组。

如果指针p1指向的元素小于等于指针p2指向的元素，指针p1加一。
如果指针p1指向的元素大于指针p2指向的元素，将指针p2指向的元素插入到指针p1指向的位置，指针p1指向的位置后面的元素都后移一位，指针p1和指针p2都加一。

这样操作直到数组nums1或数组nums2的元素遍历完，如果是数组nums1的元素先被遍历完，则将数组nums2剩下的元素都复制到数组nums1中，如果是数组nums2的元素先被遍历完，算法停止。在最坏的情况下，数组nums2的元素都小于数组nums1的元素，每次插入数组nums2的元素，都要移动数组nums1的 $O(m)$ 个字符，那么算法复杂度为 $O(nm)$。

时间复杂度为 O(n + m) 的解法

换个思路，我们不从头开始遍历数组，而是从数组后面开始遍历数组。这里我们准备三个指针：p、p1和p2，指针p指向数组nums1的第 n + m 个元素，指针p1指向数组nums1最后一个元素，指针p2指向数组nums2的最后一个元素。

如果指针p1指向的元素小于指针p2指向的元素，将指针p2指向的元素复制到指针p指向的位置，指针p和p2减一。
如果指针p1指向的元素大于等于指针p2指向的元素，将指针p1指向的元素复制到指针p指向的位置，指针p和p1减一。

这样操作直到数组nums1或数组nums2被遍历完，如果是数组nums1先被遍历完，将数组nums2剩余的元素复制到数组nums1中，如果是数组nums2先被遍历完，则算法停止。从上面分析我们可以看出，只需要遍历一次数组nums1和数组nums2就行，不需要移动元素，因此这个算法的时间复杂度为 $O(n + m)$。

#include <iostream>
#include <vector>
using namespace std;

class Solution {
public:
    void merge(vector<int>& nums1, int m, vector<int>& nums2, int n) {

        if (m <= 0)
            nums1 = nums2;
        if (n <= 0)
            return;

        int p, p1, p2;
        p1 = m - 1;
        p2 = n - 1;
        p = m + n - 1;

        while (p >= 0 && p1 >= 0 && p2 >= 0) {
            if (nums1[p1] >= nums2[p2]) {
                nums1[p] = nums1[p1];
                p1--;
            } else if (nums1[p1] < nums2[p2]) {
                nums1[p] = nums2[p2];
                p2--;
            }
            p--;
        }

        if (p2 >= 0) {
            while (p >=0 && p2 >=0)
                nums1[p--] = nums2[p2--];
        }
    }
};

int main()
{
    Solution solu;

    vector<int> nums1 = {1, 3, 5, 7, 0, 0, 0};
    vector<int> nums2 = {2, 4, 6};
    solu.merge(nums1, 4, nums2, nums2.size());

    for (auto it=nums1.begin(); it!=nums1.end(); ++it)
        cout << *it << " ";
    cout << endl;

    vector<int> nums3 = {0, 0, 3, 0, 0, 0, 0, 0, 0};
    vector<int> nums4 = {-1, 1, 1, 1, 2, 3};
    solu.merge(nums3, 3, nums4, nums4.size());

    for (auto it=nums3.begin(); it!=nums3.end(); ++it)
        cout << *it << " ";
    cout << endl;

    return 0;
}

LeetCode 215 - Kth Largest Element in an Array

Posted on 2018-03-09 Edited on 2025-03-02

问题描述：

Find the kth largest element in an unsorted array. Note that it is the kth largest element in the sorted order, not the kth distinct element.
For example,
Given [3,2,1,5,6,4] and k = 2, return 5.

Note:
You may assume k is always valid, 1 ≤ k ≤ array’s length.

这道题目最简单的思路莫过于把输入的 $n$ 个数进行排序，排序之后第 $k$ 个数字 (如果数是从大到小排序) 就所要求的第 $k$ 大的数字，这种思路的时间复杂度是 $O(nlog(n))$ 。但是这样还不是最快的做法，接下来讨论更快的算法。

时间复杂度为 O(nlog(k)) 的算法

我们可以先创建一个大小为 $k$ 的数据容器来存储最大的 $k$ 个数，接下来每次从输入的 $n$ 个数中读入一个数。如果容器中已有的数字少于 $k$ 个，则直接把这次读入的数放入容器之中；如果容器中已有 $k$ 个数字了，也就是容器已经满了，此时只能替换容器中的数字了。找出容器中 $k$ 个数的最小值，然后拿这次待插入的数与其比较，如果待插入的数字比容器中的最小值要大，则用这个数替换当前已有的最小值，如果待插入的数字比这个最小值要小，那么这个数不能是最大的 $k$ 个数之一，则抛弃这个数字。

因此，当容器满了以后，我们要做三件事：一是在 $k$ 个数中找到最小数；二是有可能在这个容器中删除最小数；三是有可能要插入一个新的数字。如果使用一棵二叉树来实现这个容器，那么我们能在 $O(log(k))$ 的时间内实现这三步操作。因此，对于 $n$ 个输入数字而言，总的时间效率是 $O(n log(k))$。

由于每次都需要找到 $k$ 个数中的最小值，很自然可以想到用最小堆。当然，我们还可以考虑是用红黑树来实现这个容器，红黑树可以保证查找、删除和插入这些操作都只需要 $O(log(k))$ 时间，在STL中，set 和 multiset 都是基于红黑树实现。

基于最小堆实现


#include <queue>
#include <iostream>
using namespace std;

class Solution {
public:
    int findKthLargest(vector<int>& nums, int k) {
        priority_queue<int, vector<int>, greater<int>> k_numbers;
        int min_number;

        for (auto it=nums.begin(); it!=nums.end(); ++it) {
            if (k_numbers.size() < k) {
                k_numbers.push(*it);
            } else {
                min_number = k_numbers.top();
                if (*it > min_number) {
                    k_numbers.pop();
                    k_numbers.push(*it);
                }
            }
        }

        return k_numbers.top();
    }
};

int main()
{
    Solution solu;
    vector<int> nums = {3, 2, 1, 5, 6, 4};

    cout << "The 2th max number is " << solu.findKthLargest(nums, 2) << endl;

    return 0;
}

基于红黑树实现


#include <vector>
#include <set>
#include <iostream>
using namespace std;

class Solution {
public:
    int findKthLargest(vector<int>& nums, int k) {
        multiset<int> k_numbers;
        multiset<int>::iterator set_iterator;

        for (auto it=nums.begin(); it!=nums.end(); ++it) {
            if (k_numbers.size() < k) {
                k_numbers.insert(*it);
            } else {
                set_iterator = k_numbers.begin();
                if (*it > *set_iterator) {
                    k_numbers.erase(set_iterator);
                    k_numbers.insert(*it);
                }
            }
        }

        return *(k_numbers.begin());
    }
};

int main()
{
    Solution solu;
    vector<int> nums = {3, 2, 1, 5, 6, 4};

    cout << "The 2th max number is " << solu.findKthLargest(nums, 2) << endl;

    return 0;
}

剑指offer - 变态跳台阶

Posted on 2018-03-08 Edited on 2025-03-02

题目描述：一只青蛙一次可以跳上1级台阶，也可以跳上2级 …… 它也可以跳上n级。求该青蛙跳上一个n级的台阶总共有多少种跳法。

先从最简单的情况说起。如果只有1级台阶，跳法只有一种。如果有2级台阶，那就有两种跳法：一种是分两次跳，每次跳1级；另一种就是一次跳2级。如果有3级台阶，那就有四种跳法：一种是分三次跳，每次跳1级；一种是先跳2级，再跳1级；一种是先跳1级，再跳2级；还有一种是一次跳3级。

接下来讨论一般情况。我们把n级台阶时的跳法看成n的函数，记为$f(n)$。根据前面分析，可以得到
$$
\begin{aligned}
f(1) & = 1 = 2^0 \
f(2) & = 2 = 2^{2-1} \
f(3) & = 4 = 2^{3-1}
\end{aligned}
$$

可以看出一个规律，那就是$f(n) = 2^{n-1}$，那么怎么证明呢？先分析下n级台阶怎么跳，青蛙第一次跳有n中选择，可以第一次跳1级，此时后面台阶的跳法数目等于$f(n-1)$，如果第一次跳2级，此时后面台阶的跳法数目等于$f(n-2)$，一直到第一次直接跳$n$级，此时剩余台阶数为0，则跳法数目为$f(0)$，这里约定$f(0)=1$，为了后面分析方便。综上可以得到一个递推公式：
$$
f(n) = f(n-1) + f(n-2) + \cdots + f(2) + f(1) + f(0)
$$
接下来用数学归纳法证明$f(n)=2^{n-1}$。如果前$n$项满足公式$f(n)=2^{n-1}$，那么对于$f(n+1)$，可以得到
$$
\begin{align}
f(n+1) &= f(n) + f(n-1) + \cdots + f(1) + f(0) \
& = 2^{n-1} + 2^{n-2} + \cdots + 2^0 + 1 \
& = \frac{1 - 2^n}{1 - 2} + 1 \
& = 2^n
\end{align}
$$
可看出$f(n+1)$仍然满足公式。知道通向公式后，剩下的实现就很简单了，贴上C++代码。

#include <iostream>
using namespace std;

class Solution {
public:
    int jumpFloorII(int number) {
        if (number == 0)
            return 0;

        int f = 1;
        for (int i=1; i<number; i++)
            f *= 2;

        return f;
    }
};

int main()
{
    Solution solu;

    cout << "f(0)= " << solu.jumpFloorII(0) << endl;
    cout << "f(1)= " << solu.jumpFloorII(1) << endl;
    cout << "f(2)= " << solu.jumpFloorII(2) << endl;
    cout << "f(5)= " << solu.jumpFloorII(5) << endl;
    return 0;
}

大量图片文件生成SequenceFile

Posted on 2017-03-12 Edited on 2025-03-02

本文主要介绍如何将大量图片数据集合起来，然后生成一个SequenceFile，这么做的原因是避免Hadoop处理大量小文件导致性能下降。SequenceFile的key-value内容以及数据类型为：

1 2	key : 储存图片文件名，数据类型为Text value : 储存图片数据，数据类型为ArrayPrimitiveWritable

为什么要用ArrayPrimitiveWritable来存储图片数据，主要是我想将图片数据用一个int数组来存储，数组元素就是图像的像素，数据范围在[0, 255]，当然，你可以使用byte数组来存储图片数据。

Java获取指定目录下的所有文件

Posted on 2017-03-11 Edited on 2025-03-02

本文介绍了如何使用Java来获取指定目录下的所有文件。

获取指定目录下的所有文件

使用递归方法遍历目录，如果访问的是文件夹，则进入递归，否则将文件加入到文件列表中，同时可以设定遍历深度（从深度1开始）。

import java.io.File;
import java.util.ArrayList;

public class ListFiles {

    /**
     * 获取指定目录下的所有文件（不包括文件夹)
     * @param path 文件夹的File对象或者文件夹路径字符串
     * @param depth 遍历深度，从深度1开始
     * @return 指定目录下所有的文件的File对象的集合
     */
    public static ArrayList<File> getAllFiles(Object path, int depth) {
        File directory = null;
        if (path instanceof File) {
            directory = (File)path;
        } else {
            directory = new File(path.toString());
        }

        ArrayList<File> files = new ArrayList<File>();
        if (directory.isFile()) {
            files.add(directory);
        } else if (directory.isDirectory()) {
            depth--;
            if (depth >= 0) {
                File[] fileList = directory.listFiles();
                for (File file : fileList) {
                    files.addAll(getAllFiles(file, depth));
                }
            }
        }

        return files;
    }
}

测试代码如下：

import org.junit.Test;

import java.io.File;
import java.util.ArrayList;

public class ListFilesTest {
    @Test
    public void getAllFiles() throws Exception {
        String path = "/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces";

        ArrayList<File> files = ListFiles.getAllFiles(path, 1);
        int n = 0;
        for (File file : files) {
            n++;
            System.out.println(file.toString());
        }
        System.out.println("Total : " + n);
    }

}

运行结果如下：

...
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zoran_Djindjic_0004.pgm
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zorica_Radovic_0001.pgm
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zulfiqar_Ahmed_0001.pgm
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zumrati_Juma_0001.pgm
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zurab_Tsereteli_0001.pgm
/Users/luowanqian/Documents/Datasets/LFWCrop/lfwcrop_grey/faces/Zydrunas_Ilgauskas_0001.pgm
Total : 13233

Reference

Java读取PGM图片

Posted on 2017-03-11 Edited on 2025-03-02

最近下载了一个人脸数据库，里面的图片都是pgm格式，想提取每张图片的所有像素的像素值，初步设想是使用Java SE中读取图片。在使用Java SE中类来读取图片时，通常使用ImageIO包中的类来读取图片，但是有个缺点就是默认支持的图片格式很少，Java 8中javax.imageio支持的图片格式有：

从图中看出库仅仅支持JPEG、PNG、BMP、WBMP以及GIF这些图片格式。如果想读取其他格式（例如PGM）时应该怎么办？网上搜索到了一个Java ImageIO的扩展插件——TwelveMonkeys，安装后ImageIO就可以支持多种图片格式，这个有一个好处就是原先读取图片的代码不需要改变，读取图片还是使用 javax.imageio 包。

安装TwelveMonkeys

安装很简单，使用Maven可以很容易地添加到项目中，如果想让ImageIO支持JPEG和TIFF格式，可以在POM文件中添加下面的代码：

...
<dependencies>
    ...
    <dependency>
        <groupId>com.twelvemonkeys.imageio</groupId>
        <artifactId>imageio-jpeg</artifactId>
        <version>3.3.1</version> <!-- Alternatively, build your own version -->
    </dependency>
    <dependency>
        <groupId>com.twelvemonkeys.imageio</groupId>
        <artifactId>imageio-tiff</artifactId>
        <version>3.3.1</version> <!-- Alternatively, build your own version -->
    </dependency>
</dependencies>

如果要支持PGM格式，可以添加下面代码：

<dependency>
    <groupId>com.twelvemonkeys.imageio</groupId>
    <artifactId>imageio-pnm</artifactId>
    <version>3.3.1</version>
</dependency>

想要支持什么格式唯一要变的地方就是artifactId中内容，基本模版就是imageio-xxx。

读取PGM图片

读取一张PGM图片，然后将图片的所有像素的像素值输出到一个文本文件，实现代码如下：

灰度图片

package Image;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.awt.image.Raster;
import java.io.*;

public class PGM {
    public static void main(String[] args) throws IOException {
        // Grayscale image
        String path = "/Users/luowanqian/Downloads/Zach_Pillar_0001.pgm";
        String output = "/Users/luowanqian/Downloads/output.txt";

        BufferedImage image = ImageIO.read(new File(path));
        Raster source = image.getRaster();
        int width = image.getWidth();
        int height = image.getHeight();

        int[] pixels = new int[width*height];
        source.getPixels(0, 0, width, height, pixels);

        BufferedWriter out = new BufferedWriter(new FileWriter(output));
        for (int i=0; i<width*height; i++)
            out.write(pixels[i] + " ");
        out.close();
    }
}

人脸数据库Yale

Posted on 2017-03-09 Edited on 2025-03-02

Yale人脸数据库在人脸识别论文中频繁用到，因此做人脸识别这方面研究还是存一份这个数据库以备用。在网上可以很容易下载这个数据库，我在浙大蔡登教授的主页上下载了这个数据库（Four face databases in matlab format)，主页上这个数据库有两种大小可以选择下载，一个是图像大小是32x32的数据库，另一个是64x64，本文选择的是32x32。

数据集说明

主页上有对这个数据集说明：

Contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink.

显示人脸图像

主页上提供显示数据库中人脸图像的代码，整理如下：

% load data
load('Yale_32x32.mat');

faceW = 32;
faceH = 32;
numPerLine = 11;
ShowLine = 2;

Y = zeros(faceH*ShowLine, faceW*numPerLine);
for i=0:ShowLine-1
    for j=0:numPerLine-1
        Y(i*faceH+1:(i+1)*faceH, j*faceW+1:(j+1)*faceW) = reshape(fea(i*numPerLine+j+1,:), [faceH, faceW]);
    end
end

imagesc(Y);
colormap(gray);

pause();

导出数据到文本文件

为了后续使用，我将人脸数据导出到文本文件中，文本每行包含了一张人脸数据以及该人脸对应的类标，写了一个简短的代码：

clear all; close all; clc;

% Load data
% Contains variables 'fea' and 'gnd'
% Each row of 'fea' is a face; 'gnd' is the label
load('Yale_32x32.mat');

output = 'data.txt';
[row, col] = size(fea);

file = fopen(output, 'w');
for i=1:row
    for j=1:col
        fprintf(file, '%d ', fea(i,j));
    end
    fprintf(file, '%d\n', gnd(i));
end

fclose(file);

代码运行后生成一个文本文件 __data.txt__。

Hadoop Streaming使用

Posted on 2017-03-07 Edited on 2025-03-02

关于 Hadoop Streaming 的使用，可以参考官方网站 Hadoop Streaming。个人觉得使用Steaming这个功能，可以使用其他语言来编写MapReduce程序，相比使用Java来编写程序，工作量小了很多。

根据《Hadoop权威指南》中例子使用python来实现 Max Temperature 这个程序，要分别实现Map函数以及Reduce函数。

Map函数

#!/usr/bin/env python

import re
import sys

for line in sys.stdin:
    val = line.strip()
    (year, temp, q) = (val[15:19], val[87:92], val[92:93])
    if (temp != "+9999" and re.match("[01459]", q)):
        print "%s\t%s" % (year, temp)

Reduce函数

#!/usr/bin/env python

import sys

(last_key, max_val) = (None, -sys.maxint)
for line in sys.stdin:
    (key, val) = line.strip().split("\t")
    if last_key and last_key != key:
        print "%s\t%s" % (last_key, max_val)
        (last_key, max_val) = (key, int(val))
    else:
        (last_key, max_val) = (key, max(max_val, int(val)))

if last_key:
    print "%s\t%s" % (last_key, max_val)

本文用到的文件如下：

其中 1901.txt为数据文件，max_temperature_map.py定义了Map函数，max_temperature_reduce.py定义了Reduce函数

定义好Map函数以及Reduce函数后，利用unix的管道将Map过程以及Reduce过程连接起来，使用下面命令

1	$ cat 1901.txt \| ./max_temperature_map.py \| sort \| ./max_temperature_reduce.py

运行结果如下：

运行结果正确，此时就可以使用Hadoop的Streaming了，使用下面命令

$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
    -fs file:///  \
    -input 1901.txt \
    -output output \
    -mapper max_temperature_map.py \
    -reducer max_temperature_reduce.py

注意：如果文件是在本地文件系统中而不是HDFS中时，要使用 -fs file:/// 这个命令选项

运行结果保存到 output 文件夹中

Hadoop的FileNotFound问题

Posted on 2017-02-28 Edited on 2025-03-02

在实验《Hadoop权威指南第3版》书本上的例子4-2时发现了一个关于 FileSystem 的问题。例子代码如下：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }

        String outputUri =
                CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());

        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

编译好按照使用书本上的命令行运行程序

1	% hadoop FileDecompressor file.gz

运行时抛出了 java.io.FileNotFoundException 异常

错误说找不到文件 __/user/luowanqian/file.gz__，但是我的 file.gz 并不在这个目录里面，而是在 __/home/luowanqian/Downloads/Temp/MaxTemperature__，如果使用绝对路径作为参数传递给程序也会报错，最终在绝对路径前面加上 file:// 后程序就没有报错了。

初步分析了这个问题，个人认为是程序默认在HDFS中寻找 __file.gz__，而 file.gz 是在本地文件系统中，因此程序要识别这个文件是在本地文件系统中，需要在路径前面添加 __file://__，为了进一步验证我的猜想，我修改了原来的代码。

1	FileSystem fs = FileSystem.get(URI.create(uri), conf);

修改成

1	FileSystem fs = FileSystem.getLocal(conf);

再次使用下面命令运行程序

1	% hadoop FileDecompressor file.gz

这个时候程序就正常运行了

至于程序如何判断文件是在怎么样的文件系统目前还不知道怎么解决，初学Hadoop还是很多地方不清楚。

Hadoop的Connection Refused问题

Posted on 2017-02-24 Edited on 2025-03-02

最近在学习Hadoop时候遇到了一个问题，就是操作Hadoop文件系统时遇到 Connection Refused 问题，当时问题显示如下：

遇到这问题时，首先查看了错误信息提示中提到的网址 https://wiki.apache.org/hadoop/ConnectionRefused，文章中提示说是端口问题，但是经过排查不是端口问题，所以只能排查其他原因了。首先使用Java的 jps 命令查看JVM，查看结果如下：

发现没有NameNode，然后查看了Hadoop关于NameNode日志（日志文件默认在Hadoop安装目录中的 logs 目录下），发现了一些东西：

日志说 /tmp/hadoop-luowanqian/dfs/name 目录不存在，这个目录是HDFS文件系统的镜像文件保存的地方，HDFS文件系统格式化后其镜像文件默认保存在 tmp 目录中，而这个目录中的文件在重启系统后会自动被删除，因此访问HDFS文件系统会失败，重新格式化HDFS文件系统后再访问文件系统正常了。