字典树

发表于 2020-08-05 | 更新于 2020-08-05 | 分类于算法，数据结构，手撕代码

1 字典树（Trie）

2 实现方法

class Trie:

    def __init__(self):
        """
        Initialize your data structure here.
        """
        self.root = {}
        # 特殊标记
        self.word_end = -1



    def insert(self, word: str) -> None:
        """
        Inserts a word into the trie.
        """
        cur_node = self.root
        for char in word:
            if not char in cur_node:
                cur_node[char] = {}
            cur_node = cur_node[char]
        cur_node[self.word_end] = True
                

    def search(self, word: str) -> bool:
        """
        Returns if the word is in the trie.
        """
        cur_node = self.root
        for char in word:
            if char not in cur_node:
                return False
            cur_node = cur_node[char]
        
        if self.word_end not in cur_node:
            return False
        return True



    def startsWith(self, prefix: str) -> bool:
        """
        Returns if there is any word in the trie that starts with the given prefix.
        """
        cur_node = self.root
        for char in prefix:
            if not char in cur_node:
                return False
            cur_node = cur_node[char]
        
        return True

# Your Trie object will be instantiated and called as such:
# obj = Trie()
# obj.insert(word)
# param_2 = obj.search(word)
# param_3 = obj.startsWith(prefix)

哈夫曼树、哈夫曼编码

发表于 2020-07-28 | 更新于 2020-07-28 | 分类于数据结构

哈夫曼树及哈夫曼编码详解【完整版代码】

概率

发表于 2020-07-25 | 更新于 2020-07-25 | 分类于数学

1. 几何概率

1.1 一根木棒，截成三截，组成三角形的概率是多少。

答案：0.25

画图求概率

$x+y>1-x-y$ => $2x+2y > 1$

$x+1-x-y>y$ => $y < 1/2$

$x-y<1-x-y$ => $x < 1/2$

滑动窗口

发表于 2020-07-23 | 更新于 2020-07-24 | 分类于手撕代码

1.前言

滑动窗口用来解决一些查找满足一定条件的连续区间的性质（长度等）的问题。由于区间连续，因此当区间发生变化时，可以通过旧有的计算结果对搜索空间进行剪枝，这样便减少了重复计算，降低了时间复杂度。

2. 两种形式

非固定长度窗口—快慢指针；如满足某一条件的子数组的最大最小长度。
固定长度窗口—双端队列；如满足固定长度、某一条件的子数组中的最大最小值。

3. 非固定长度窗口

3.1 解题框架

left = 0
right = 0

window = []
while right < len(s):
	window.append(s[right])
  right += 1
  
  while (windon needs shrink):
    window.pop(0)
    left += 1

3.2 长度最小的子数组

leetcode:209 长度最小的子数组

给定一个含有 n 个正整数的数组和一个正整数 s ，找出该数组中满足其和 ≥ s 的长度最小的连续子数组，并返回其长度。如果不存在符合条件的子数组，返回 0。

1
2
3

输入：s = 7, nums = [2,3,1,2,4,3]
输出：2
解释：子数组 [4,3] 是该条件下的长度最小的子数组。

class Solution:
    def minSubArrayLen(self, s: int, nums: List[int]) -> int:
        if not nums: return 0

        left = 0
        right = 0
        
        summation = 0
        res = len(nums) + 1
       
        while right < len(nums):
            summation += nums[right]
            right += 1

            while summation >= s:
                res = min(res, right-left)
                summation -= nums[left]
                left += 1
 
        return res if res != len(nums)+1 else 0

3.2 无重复字符的最长子串

Leetcode:3 无重复字符的最长子串

给定一个字符串，请你找出其中不含有重复字符的 最长子串 的长度。

1
2
3

输入: "abcabcbb"
输出: 3 
解释: 因为无重复字符的最长子串是 "abc"，所以其长度为 3。

1
2
3

输入: "bbbbb"
输出: 1
解释: 因为无重复字符的最长子串是 "b"，所以其长度为 1。

输入: "pwwkew"
输出: 3
解释: 因为无重复字符的最长子串是 "wke"，所以其长度为 3。
     请注意，你的答案必须是 子串 的长度，"pwke" 是一个子序列，不是子串。

class Solution:
    def lengthOfLongestSubstring(self, s: str) -> int:
        left = 0
        right = 0
        res = 0

        char2count = {}
        while right < len(s):
            char = s[right]
            right += 1
            if char not in char2count:
                char2count[char] = 0
            char2count[char] += 1

            while char2count[char] > 1:
                new_char = s[left]
                char2count[new_char] -= 1
                left += 1
            res = max(res, right-left)
        
        return res

3.3 最大连续1的个数 III

Leetcode 1004. 最大连续1的个数 III

给定一个由若干 0 和 1 组成的数组 A，我们最多可以将 K 个值从 0 变成 1 。

返回仅包含 1 的最长（连续）子数组的长度。

输入：A = [1,1,1,0,0,0,1,1,1,1,0], K = 2
输出：6
解释： 
[1,1,1,0,0,**1**,1,1,1,1,**1**]
粗体数字从 0 翻转到 1，最长的子数组长度为 6。

class Solution:
    def longestOnes(self, A: List[int], K: int) -> int:
        left = 0
        right = 0

        res = 0
        while right < len(A):
            if A[right] == 0:
                K -= 1
            right += 1

            while K < 0:
                if A[left] == 0:
                    K += 1
                left += 1
            
            res = max(res, right-left)
        
        return res

3.4 最小覆盖子串

Leetcode: 76. 最小覆盖子串

给你一个字符串 S、一个字符串 T，请在字符串 S 里面找出：包含 T 所有字符的最小子串。

1 2	输入: S = "ADOBECODEBANC", T = "ABC" 输出: "BANC"

说明：

如果 S 中不存这样的子串，则返回空字符串 ""。
如果 S 中存在这样的子串，我们保证它是唯一的答案。

class Solution:
    def minWindow(self, s: str, t: str) -> str:
        from collections import defaultdict

        def helper(needs, windows):
            for key, value in needs.items():
                if windows[key] < value:
                    return False
            return True
        
        needs = defaultdict(int)
        windows = defaultdict(int)
        for char in t:
            needs[char] += 1
            windows[char] = 0
        
        left = 0
        right = 0
        start = 0

        res = len(s) + 1

        while right < len(s):
            char = s[right]
            right += 1

            if char in windows:
                windows[char] += 1
            

            while helper(needs, windows):
                if right-left < res:
                    start = left
                    res = right-left
                
                char = s[left]
                if char in windows:
                    windows[char] -= 1
                left += 1

        return '' if res == len(s) + 1 else s[start:start+res]

class Solution:
    def findAnagrams(self, s: str, p: str) -> List[int]:

        from collections import defaultdict

        def check(needs, windows):
            for key, value in needs.items():
                if windows[key] != value:
                    return False
            return True

        needs = defaultdict(int)
        windows = defaultdict(int)

        for char in p:
            needs[char] += 1
            windows[char] = 0

        left = 0 
        right = 0
        res = []

        while right < len(s):
            char = s[right]
            if char in windows:
                windows[char] += 1
            right += 1

            while right - left == len(p):
                if check(needs, windows):
                    res.append(left)
                char = s[left]
                if char in windows:
                    windows[char] -= 1
                left += 1
        return res

3.5 至多包含两个不同字符的最长子串

给定一个字符串 s ，找出至多包含两个不同字符的最长子串 t 。

1
2
3

输入: "eceba"
输出: 3
解释: t 是 "ece"，长度为3。

1
2
3

输入: "ccaabbb"
输出: 5
解释: t 是 "aabbb"，长度为5。

class Solution:
    def lengthOfLongestSubstring(self, s: str) -> int:
        count = 0
        char2count = {}
        left = 0
        right = 0
        res = 0 
      
        while right < len(s):
            char = s[right]
            right += 1
            if char not in char2count:
                char2count[char] = 0
            char2count[char] += 1
            if char2count[char] == 1:
                count += 1
            
            while count > 2:
                char = s[left]
                char2count[char] -= 1
                if char2count[char] == 0:
                    count -= 1
                left += 1
                
            res = max(res, right-left)
        return res

3.6 字符串的排列

Leetcode: 567. 字符串的排列

给定两个字符串 s1 和 s2，写一个函数来判断 s2 是否包含 s1 的排列。

换句话说，第一个字符串的排列之一是第二个字符串的子串。

1
2
3

输入: s1 = "ab" s2 = "eidbaooo"
输出: True
解释: s2 包含 s1 的排列之一 ("ba").

1 2	输入: s1= "ab" s2 = "eidboaoo" 输出: False

class Solution:
    def checkInclusion(self, s1: str, s2: str) -> bool:
        from collections import defaultdict

        def check(needs, windows):
            for key,value in needs.items():
                if value != windows[key]:
                    return False
            return True

        needs = defaultdict(int)
        windows = defaultdict(int)     
        res = True

        for c in s1:
            needs[c] += 1
            windows[c] = 0

        left = 0
        right = 0

        while right < len(s2):
            char = s2[right]
            right += 1
            windows[char] += 1
            
            while right-left == len(s1):
                if check(needs, windows):
                    return True
                char = s2[left]
                windows[char] -= 1
                left += 1
        return False

3.7 串联所有单词的子串

Leetcode 30. 串联所有单词的子串

给定一个字符串 s 和一些长度相同的单词 words。找出 s 中恰好可以由 words 中所有单词串联形成的子串的起始位置。

注意子串要与 words 中的单词完全匹配，中间不能有其他字符，但不需要考虑 words 中单词串联的顺序。

输入：
  s = "barfoothefoobarman",
  words = ["foo","bar"]
输出：[0,9]
解释：
从索引 0 和 9 开始的子串分别是 "barfoo" 和 "foobar" 。
输出的顺序不重要, [9,0] 也是有效答案。

输入：
  s = "wordgoodgoodgoodbestword",
  words = ["word","good","best","word"]
输出：[]

class Solution:
    def findSubstring(self, s: str, words: List[str]) -> List[int]:
        from collections import Counter

        if not words or not s: return []
        
        one_word = len(words[0])
        word_num = len(words)

        n = len(s)
        words = Counter(words)
        res = []

        for i in range(0, one_word):
            cur_count = 0
            left = i
            right = i
            cur_Counter = Counter()
            while right + one_word <= n:
                w = s[right:right + one_word]
                right += one_word
                if w not in words:
                    left = right
                    cur_Counter.clear()
                    cur_count = 0
                else:
                    cur_Counter[w] += 1
                    cur_count += 1
                    while cur_Counter[w] > words[w]:
                        left_w = s[left:left+one_word]
                        left += one_word
                        cur_Counter[left_w] -= 1
                        cur_count -= 1
                    if cur_count == word_num:
                        res.append(left)
        return res

3.8 最小区间

你有 k 个升序排列的整数数组。找到一个最小区间，使得 k 个列表中的每个列表至少有一个数包含在其中。

我们定义如果 b-a < d-c 或者在 b-a == d-c 时 a < c，则区间 [a,b] 比 [c,d] 小。

输入:[[4,10,15,24,26], [0,9,12,20], [5,18,22,30]]
输出: [20,24]
解释:
列表 1：[4, 10, 15, 24, 26]，24 在区间 [20,24] 中。
列表 2：[0, 9, 12, 20]，20 在区间 [20,24] 中。
列表 3：[5, 18, 22, 30]，22 在区间 [20,24] 中。

class Solution(object):
    def smallestRange(self, nums: List[List[int]]): -> List[int]
        from collections import defaultdict

        v  = list()
        for i in range(len(nums)):
            for num in nums[i]:
                v.append((num, i))
        v.sort()

        left, right, n = 0, 0, len(v)
        index2count = collections.defaultdict(int)
        k = len(nums)
        count = 0
        res = [0, 0]
        diff = float('inf')

        while right < n:
            index = v[right][1]
            if index2count[index] == 0:
                count += 1
            index2count[index] += 1
            right += 1

            while count == k:
                if v[right][0] - v[left][0] < diff:
                    diff = v[right][0] - v[left][0]
                    res = [v[left][0], v[right][0]]
                d[v[left][1]] -= 1

                if d[v[left][1]] == 0:
                    count -= 1
                left += 1
        return res

4. 固定窗口

4.1 滑动窗口最大值

239. 滑动窗口最大值

给定一个数组 nums，有一个大小为 k 的滑动窗口从数组的最左侧移动到数组的最右侧。你只可以看到在滑动窗口内的 k 个数字。滑动窗口每次只向右移动一位。

返回滑动窗口中的最大值。

输入: nums = [1,3,-1,-3,5,3,6,7], 和 k = 3
输出: [3,3,5,5,6,7] 
解释: 

  滑动窗口的位置                最大值
---------------               -----
[1  3  -1] -3  5  3  6  7       3
 1 [3  -1  -3] 5  3  6  7       3
 1  3 [-1  -3  5] 3  6  7       5
 1  3  -1 [-3  5  3] 6  7       5
 1  3  -1  -3 [5  3  6] 7       6
 1  3  -1  -3  5 [3  6  7]      7

1	类似于单调队列的思想

class Solution:
    def maxSlidingWindow(self, nums: List[int], k: int) -> List[int]:
      
        from collections import deque

        queue = deque()  

        # init queue
        for i in range(k):
            while queue and nums[queue[-1]] <= nums[i]: queue.pop()
            queue.append(i)
        
        res = [nums[queue[0]]]
        # slide the window
        for i in range(k, len(nums)):
            # remove the start of the window
            while queue and i - queue[0] + 1 > k:
                queue.popleft()

            while queue and nums[queue[-1]] <= nums[i]: queue.pop()
            queue.append(i)
            
            res.append(nums[queue[0]])
        return res

面试大杂烩

发表于 2020-07-21 | 更新于 2020-07-21

偏差：描述模型输出结果的期望与样本真实结果的差距。
方差：描述模型对于给定值的输出稳定性。

如果模型不能很好地拟合训练数据，则模型具有高偏差（也称欠拟合 - Underfitting）；
反之，如果模型能够很好地拟合训练数据，但在测试数据上有较大的误差，则模型具有高方差（也称过拟合 - Overfitting）。

[机器学习]偏差和方差的理解

True Positive Rate（真阳率）、False Positive（伪阳率）

TPRate的意义是所有真实类别为1的样本中，预测类别为1的比例。
FPRate的意义是所有真实类别为0的样本中，预测类别为1的比例。

如何理解机器学习和统计中的AUC？

朴素贝叶斯

发表于 2020-07-19 | 更新于 2020-07-19 | 分类于机器学习

1. 定义

朴素贝叶斯法（Naive Bayes）：基于贝叶斯定理与特征条件独立的分类方法。

对于给定的训练数据集，基于特征条件独立假设学习输入输出的联合概率分布；
基于联合概率分布，对给定的输入 $x$ ，利用贝叶斯定理求出后验概率最大输出 $ y $ 。

条件概率公式 $P(A|B) = \frac{P(A, B)}{P(B)}$ $P(B|A) = \frac{P(A, B)}{P(A)}$
贝叶斯定理 $P(A|B) = P(B|A) * \frac{P(A)}{P(B)}$

2 基本方法

联合概率分布 $P(X, Y)$

先验概率分布 $P(Y = c_k)， k=1,2,...,K$
条件概率分布 $P(X=x|Y=c_k) = P(X^{(1)}=x^{(1)}, ..., X^{(n)}=x^{(n)} | Y=c_k), k=1,2,...,K$
条件概率分布加入强的条件独立假设 $\begin{align} P(X=x|Y=c_k) &= P(X^{(1)}=x^{(1)}, ..., X^{(n)}=x^{(n)} | Y=c_k) \\ &= \prod_{j=1}^{n} P(X^{(j)} = x^{(j)} | Y=c_k) \\ \end{align}$
朴素贝叶斯法实际上学习到生成数据的机制，属于生成模型。条件独立假设等于是说分类的特征在类确定的条件下都是条件独立的，这一假设使得朴素贝叶斯算法变得简单，但有时会牺牲一定的分类准确率。
预测时，计算后验概率分布，将后验概率最大的类作为 $x$ 的类输出 $\begin{align} P(Y=c_k | X = x) &= \frac{ P (X=x |Y=c_k) * \frac{P(Y=c_k)}{P(X=x)} }{ \sum_k P (X=x | Y=c_k) * \frac{P(Y=c_k)}{P(X=x)} } \\ &= \frac{ P (X=x | Y=c_k) * P(Y=c_k) }{ \sum_k P (X=x | Y=c_k) * P(Y=c_k) } \\ \end{align}$

拉普拉斯平滑

3 代码示例

梯度下降法、牛顿法、拟牛顿法

发表于 2020-07-12 | 更新于 2020-08-05 | 分类于最优化

1 梯度下降法

1.1 目的

求解无约束最优化问题的极小值

1.2 方法

通过不断的迭代，进行目标函数的极小化，直到收敛

1.3 泰勒一阶展开

$f(x) = f(x_0) + (x-x_0) f'(x_0)$

1.4 思想

极小化 $f(x)$ ，则 $x-x_0 = -f’(x_0) * \lambda$ ，其中 $\lambda > 0 $

1.5 结论

$ x = x_0 - \lambda * f’(x_0)$ ，其中学习率 $\lambda > 0 $

1.6 应用

数学问题中，求解极小值
机器学习中，求解目标函数最优解

2 牛顿法

2.1 简介

牛顿法，又称牛顿迭代法，主要是迭代的方法用于求解方程的根。后应用于无约束问题的最优化。

2.2 解方程

利用泰勒公式一阶展开，$f(x) = f(x_0) + (x-x_0) f’(x_0)$

求解方程 $f(x) = 0$，则 $f(x_0) + (x-x_0) f’(x_0) = 0$

推出，$x = x_0 - \frac{f(x_0)}{f’(x_0)}$

算法流程：

不断迭代 $x_{k+1} = x_k - \frac{f(x_k)}{f’(x_k)}$，

直至 $f(x_{k+1}) = 0$

举例如下：

求解$\sqrt2$ ，保留 5位小数
思路：设计 $y = f(x)$，将$x = \sqrt2$ 代入，$y = 0$，故 $f(x) = x^2 -2$

class Solution:
    def mySqrt(self, x: int) -> int:
        num_pre = x
        num_cur = (num_pre * num_pre + x) / (2 * num_pre)

        while abs(num_pre-num_cur) > 0.0001:
            num_pre = num_cur
            num_cur = (num_pre * num_pre + x) / (2 * num_pre)
        
        return num_cur

2.3 求解无约束最优化问题

利用泰勒公式二阶展开

$f(x) = f(x_0) + (x-x_0) f'(x_0) + \frac{1}{2} (x-x_0)^2 f''(x_0)$

函数$f(x)$ 在 $x_0$有极值的必要条件是 $f’(x_0) = 0$

$0 = f'(x) = f'(x_0) + (x-x_0) f''(x_0)$ $x = x_0 - \frac{ f'(x_0)}{ f''(x_0)}$

2.4 最优化问题应用至高维

引入Hessian矩阵

$H(x) = [\frac{\partial^2f}{\partial x_i \partial x_j }]_{n*n}$ $x_{k+1} = x_k - \frac{g_k}{H_k} = x_k - H_k^{-1} g_k$

其中，$g_k$ 是一阶导（梯度），$H_k$ 是Hessian矩阵

3 拟牛顿法

避免每次都要计算Hessian矩阵，采用正定矩阵近似方法实现

4 结论

4.1 速记

梯度下降时，迭代 $ x = x_0 - \lambda * f’(x_0)$ 直至收敛，其中学习率 $\lambda > 0 $
牛顿法解方程时，迭代 $ x = x_0 - \frac{f(x_0)}{f’(x_0)} $，直至收敛
牛顿法最优化时，迭代 $ x = x_0 - \frac{ f’(x_0)}{ f’’(x_0)}$ 直至收敛

4.2 对比

梯度下降法靠近极小值，收敛速度慢；通过步长控制，极值点震荡
牛顿法是二阶收敛，梯度下降法是一阶收敛，牛顿法收敛速度快
牛顿法容易陷入局部最优
牛顿法要同时求出梯度和Hessian矩阵，计算量大且不好计算；当输入向量维度$N$较大时，Hessian矩阵的大小时 $N*N$，需要内存和显存非常大

最长上升子序列

发表于 2020-07-11 | 更新于 2020-07-22 | 分类于手撕代码

1. 最长上升子序列

Leetcode 300: 最长上升子序列

给定一个无序的整数数组，找到其中最长上升子序列的长度。

1
2
3

输入: [10,9,2,5,3,7,101,18]
输出: 4 
解释: 最长的上升子序列是 [2,3,7,101]，它的长度是 4。

1.1 动态规划解法

dp[i] : 以i为结尾的最长上升子序列

class Solution:
    def lengthOfLIS(self, nums: List[int]) -> int:

        if not nums: return 0

        dp = [1] * len(nums)
        res = 1 
        for i in range(1, len(nums)):
            for j in range(i):
                if nums[i] > nums[j]:
                    dp[i] = max(dp[i], dp[j]+1)
                res = max(res, dp[i])
        return res

1.2 二分查找

基于贪心的思想，并结合二分查找解题。

新建数组 res，用于保存最长上升子序列。对原序列进行遍历，将每位元素二分插入 cell 中。

如果 res 中元素都比它小，将它插到最后
否则，用它覆盖掉比它大的元素中最小元素

总之，思想就是让 res 中存储比较小的元素。这样，res 未必是真实的最长上升子序列，但长度是对的。

class Solution:
    def lengthOfLIS(self, nums: List[int]) -> int:
        from bisect import bisect_left

        res = []
        for num in nums:
            index = bisect_left(res, num)

            if index == len(res):
                res.append(num)
            else:
                res[index] = num
     
        return len(res)

2. 俄罗斯套娃娃信封问题

Leetcode 354 俄罗斯套娃娃信封问题

给定一些标记了宽度和高度的信封，宽度和高度以整数对形式 (w, h) 出现。当另一个信封的宽度和高度都比这个信封大的时候，这个信封就可以放进另一个信封里，如同俄罗斯套娃一样。

请计算最多能有多少个信封能组成一组“俄罗斯套娃”信封（即可以把一个信封放到另一个信封里面）。

说明:
不允许旋转信封。

示例:

1
2
3

输入: envelopes = [[5,4],[6,4],[6,7],[2,3]]
输出: 3 
解释: 最多信封的个数为 3, 组合为: [2,3] => [5,4] => [6,7]。

思路：

按照宽度升序、高度降序给信封排序
此时宽度已经是升序状态，可以将问题理解为乱序的高度的最长上升子序列问题。

class Solution:
    def maxEnvelopes(self, envelopes: List[List[int]]) -> int:
        envelopes.sort(key=lambda x: (x[0], -x[1]))

        from bisect import bisect_left

        res = []
        for _, num in envelopes:
            index = bisect_left(res, num)

            if index == len(res):
                res.append(num)
            else:
                res[index] = num

        return len(res)