Numpy

介绍

Drew Conway认为数据科学包括：

黑客技术：

如编程能力
向量化操作和算法思想

数学和统计知识

如常见的分布、最小二乘法

实质性的专业知识

数据科学设计到的操作 by David Donoho

数据探索与准备

数据操作、清洗等

数据展现形式与转化

不同格式的数据操作，表格型、图像、文本等

关于数据的计算

通过编程（python或R）计算分析数据

数据建模

预测、聚类等机器学习模型

数据可视化与展示

绘图、交互式、动画等

数据科学和涉及到的学科知识

何为数据分析

用适当的统计分析方法对收集来的大量数据进行分析，提取有用信息和形成结论对数据加以详细研究和概括总结的过程。

数据分析的目的

从数据中挖掘规律、验证猜想、进行预测

准备工作

安装Anaconda

链接：Anaconda

Anaconda的安装教程

ps:安装anaconda的时候，把原有的python全部删了（否则会报错或者运行不了python3.x），然后安装的时候两个都勾选（务必确认勾选python添加到系统环境变量）

安装Pycharm

链接：Pycharm

ps:配置环境–>project interpreter—>选择anaconda中python的路径

链接：Anaconda常用命令大全

Python进阶技巧

条件表达式

列表推导式

字典推导式

#条件表达式
import math
def get_log(x):  
    '''
    计算log函数
    '''
    log_v = math.log(x) if x>0 else float('nan')
    
    return log_v
print (get_log(5))
print (get_log(-1))
print('---------分割线-------')

#列表推导式
l1 = [i for i in range(1,100) if i%2 ==0]
print(l1)
print('---------分割线-------')

#字典推导式
D = {x.upper(): x*3 for x in 'abcd'}
print(D)

结果：

1.6094379124341003
nan
---------分割线-------
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]
---------分割线-------
{'A': 'aaa', 'B': 'bbb', 'C': 'ccc', 'D': 'ddd'}

Python常用的容器类型（list 、set、dictionary、tuple）

ps：列表、集合、字典以及元组的详细基础知识见python基础，这里讲基础没有涉及的函数以及方法。

list(列表）

l = [1, 'a', 2, 'b']
print(type(l))
print('修改前：', l)

# 修改list的内容
l[0] = 3
print('修改后：', l)

# 末尾添加元素
l.append(4)
print('添加后：', l)

# 遍历list
print('遍历list(for循环)：')
for item in l:
    print(item)
    
# 通过索引遍历list
print('遍历list(while循环)：')
i = 0
while i != len(l):
    print(l[i])
    i += 1
    
# 列表合并
print('列表合并(+)：', [1, 2] + [3, 4])

# 列表重复
print('列表重复(*)：', [1, 2] * 5)

# 判断元素是否在列表中
print('判断元素存在(in)：', 1 in [1, 2])

结果：

<class 'list'>
修改前： [1, 'a', 2, 'b']
修改后： [3, 'a', 2, 'b']
添加后： [3, 'a', 2, 'b', 4]
遍历list(for循环)：
3
a
2
b
4
遍历list(while循环)：
3
a
2
b
4
列表合并(+)： [1, 2, 3, 4]
列表重复(*)： [1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
判断元素存在(in)： True

tuple(元组）

t = (1, 'a', 2, 'b')
print(type(t))

#元组的内容不能修改，否则会报错
# t[0] = 3 

# 遍历tuple
print('遍历list(for循环)：')
for item in t:
    print(item)
    
# 通过索引遍历tuple
print('遍历tuple(while循环)：')
i = 0
while i != len(t):
    print(t[i])
    i += 1
    
# 解包 unpack(就是将元祖解开并一一对应)
a, b, c,_  = t
print('unpack: ', a)

# 确保unpack接收的变量个数和tuple的长度相同，否则报错
# 经常出现在函数返回值的赋值时
# a, b, c = t
import struct  
  
#pack - unpack  
print  
print ('===== pack - unpack =====')  
  
str = struct.pack("ii", 20, 400)    ##将数据放进内存里面包装起来
print ('str:', str)  
print ('len(str):', len(str)) # len(str): 8   
  
a1, a2 = struct.unpack("ii", str)   ##从内存中把数据提取出来
print ("a1:", a1)  # a1: 20  
print ("a2:", a2)  # a2: 400  
  
print ('struct.calcsize:', struct.calcsize("ii")) # struct.calcsize: 8

结果：

<class 'tuple'>
遍历list(for循环)：
1
a
2
b
遍历tuple(while循环)：
1
a
2
b
unpack:  1
===== pack - unpack =====
str: b'\x14\x00\x00\x00\x90\x01\x00\x00'
len(str): 8
a1: 20
a2: 400
struct.calcsize: 8

dictionary（字典）

d = {'百度': 'https://www.baidu.com/',
    '阿里巴巴': 'https://www.alibaba.com/',
    '腾讯': 'https://www.tencent.com/'}

print('通过key获取value: ', d['百度'])

# 遍历key
print('遍历key: ')
for key in d.keys():
    print(key)
    
# 遍历value
print('遍历value: ')
for value in d.values():
    print(value)
    
# 遍历item
print('遍历item: ')
for key, value in d.items():
    print(key + ': ' + value)

# format输出格式
print('format输出格式：')
for key, value in d.items():
    print('{}的网址是{}'.format(key, value))

结果

通过key获取value:  https://www.baidu.com/
遍历key: 
百度
阿里巴巴
腾讯
遍历value: 
https://www.baidu.com/
https://www.alibaba.com/
https://www.tencent.com/
遍历item: 
百度: https://www.baidu.com/
阿里巴巴: https://www.alibaba.com/
腾讯: https://www.tencent.com/
format输出格式：
百度的网址是https://www.baidu.com/
阿里巴巴的网址是https://www.alibaba.com/
腾讯的网址是https://www.tencent.com/

set（集合）

#创建集合
print('创建set:')
my_set = {1, 2, 3}
print(my_set)
my_set = set([1, 2, 3, 2])
print(my_set)

#添加集合的元素，添加在最后
print('添加单个元素:')
my_set.add(3)
print('添加3', my_set)

my_set.add(4)
print('添加4', my_set)

#添加多个元素
print('添加多个元素：')
my_set.update([4, 5, 6])
print(my_set)

结果：

创建set:
{1, 2, 3}
{1, 2, 3}
添加单个元素:
添加3 {1, 2, 3}
添加4 {1, 2, 3, 4}
添加多个元素：
{1, 2, 3, 4, 5, 6}

Counter（类似于数学中的多重集）

链接：Counter

初始化（会按照从多到少排列）

import collections

c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
c2 = collections.Counter({'a':2, 'b':3, 'c':1})
c3 = collections.Counter(a=2, b=3, c=1)

print(c1)
print(c2)
print(c3)

结果：

1
2
3

Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})

update() 更新内容，注意是做“加法”，不是“替换”

1
2
3

# 注意这里是做“加法”，不是“替换”
c1.update({'a': 4, 'c': -2, 'd': 4})
print(c1)

结果：

1	Counter({'a': 6, 'd': 4, 'b': 3, 'c': -1})

访问内容[key]
- 注意和dict的区别：如果Counter中不存在key值，返回0；而dict会报错

print('a=', c1['a'])
print('b=', c1['b'])
# 对比和dict的区别
print('e=', c1['e'])

结果：

1
2
3

a= 6
b= 3
e= 0

element()方法

1 2	for element in c1.elements(): print(element)

结果：

a
a
a
a
a
a
b
b
b
d
d
d
d

most_common()方法，返回前n多的数据（比如，在词频统计中，只要求最多的前三个）

1	c1.most_common(3)

结果：

1	[('a', 6), ('d', 4), ('b', 3)]

defaultdict

在Python中如果访问字典里不存在的键，会出现KeyError异常。有些时候，字典中每个键都存在默认值是很方便的
defaultdict是Python内建dict类的一个子类，第一个参数为default_factory属性提供初始值，默认为None。它覆盖一个方法并添加一个可写实例变量。它的其他功能与dict相同，但会为一个不存在的键提供默认值，从而避免KeyError异常。

链接：defaultdict

# 统计每个字母出现的次数
s = 'chinadoop'

# 使用Counter
print(collections.Counter(s))

结果：

1	Counter({'o': 2, 'c': 1, 'h': 1, 'i': 1, 'n': 1, 'a': 1, 'd': 1, 'p': 1})

使用dict：

# 使用dict
counter = {}
for c in s:
    if c not in counter:  #初始化成1
        counter[c] = 1
    else:
        counter[c] += 1
        
print(counter.items())

结果：

1	dict_items([('c', 1), ('h', 1), ('i', 1), ('n', 1), ('a', 1), ('d', 1), ('o', 2), ('p', 1)])

使用defaultdict

# 使用defaultdict
counter2 = collections.defaultdict(int) #int初始化成一个整型的空值，就是0
for c in s:
    counter2[c] += 1
print(counter2.items())

结果：

1	dict_items([('c', 1), ('h', 1), ('i', 1), ('n', 1), ('a', 1), ('d', 1), ('o', 2), ('p', 1)])

记录相同元素的列表

# 记录相同元素的列表
colors = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = collections.defaultdict(list) #list代表，字典里面的值默认的是空的列表
for k, v in colors:
    d[k].append(v)

print(d.items())

结果：

1	dict_items([('yellow', [1, 3]), ('blue', [2, 4]), ('red', [1])])

map()函数

map(function,sequence)
可用于数据清洗

import math

print('示例1，获取两个列表对应位置上的最小值：')  # map就是做映射
l1 = [1, 3, 5, 7, 9]
l2 = [2, 4, 6, 8, 10]
mins = map(min, l1, l2)   #只用map的话，只是建立这一种计算，并没有执行，建立一个map object
print(mins)             
print('-----分割线--------')   
# map()函数操作时，直到访问数据时才会执行
for item in mins:    #通过遍历来访问数据才会执行
    print(item)
print('-----分割线--------')
print('示例2，对列表中的元素进行平方根操作：')
squared = map(math.sqrt, l2)
print(squared)
print(list(squared))

结果：

示例1，获取两个列表对应位置上的最小值：
<map object at 0x000002B8D0A2D400>
-----分割线--------
1
3
5
7
9
-----分割线--------
示例2，对列表中的元素进行平方根操作：
<map object at 0x000002B8D0A2D080>
[1.4142135623730951, 2.0, 2.449489742783178, 2.8284271247461903, 3.1622776601683795]

匿名函数lambda

简单的函数操作
返回值是func类型
可结合map()完成数据清洗操作

my_func = lambda a, b, c: a * b  #返回值是一个函数类型 function
print(my_func)
print(my_func(1, 2, 3))
print('------分割线----------')
# 结合map
print('lambda结合map：')
l1 = [1, 3, 5, 7, 9]
l2 = [2, 4, 6, 8, 10]
result = map(lambda x, y: x * 2 + y, l1, l2) #为了提高效率，尽量不用for循环（x,y是传入的参数，冒号后面是函数体）
print(list(result))

结果：

<function <lambda> at 0x000002B8CF089F28>
2
------分割线----------
lambda结合map：
[4, 10, 16, 22, 28]

科学计算库NumPy（Numerical Python）

高性能科学计算和数据分析的基础包，提供多维数组对象
ndarray，多维数组（矩阵），具有矢量运算能力，快速、节省空间
矩阵运算，无需循环，可完成类似Matlab中的矢量运算
线性代数、随机数生成
import numpy as np
链接： numpy_api

注： Scipy也是科学计算库

在Numpy库的基础上增加了众多的数学、科学及工程常用的库函数
线性代数、常微分方程求解、信号处理、图像处理、稀疏矩阵等
import scipy as sp

ndarray， N维数组对象（矩阵）

nidm属性，维度个数
shape属性，各维度大小
dtype属性，数据类型

import numpy as np
m = np.array([[2,3,4],[2,3,4],[3,4,5],[6,7,8]])
print(m)         
print(m.shape)  # 4行3列的数组 ，列方向上维度是4，行方向上维度是3
print(m.ndim)    # 维度是2个维度
print(m.dtype)   # 类型是整数型

结果：

[[2 3 4]
 [2 3 4]
 [3 4 5]
 [6 7 8]]
(4, 3)
2
int32

创建ndarray

np.array(collection),collection为序列型对象（list）,嵌套序列（list of list）
np.zeros,np.ones,np.empty指定大小的全0或全1数组
- 注意：第一个参数是元组，用来指定大小，如（3,4）
- empty不是总是返回全0，有时返回的是未初始的随机值

import numpy as np
# 创建array

my_list = [1, 2, 3]
x = np.array(my_list)

print('列表：', my_list)
print('Array: ', x)
print(x.shape)   #维度的大小，这里指在这个维度上，维度的大小是3
print(x.ndim)    # 维度是1个维度
print(x.dtype)   # 类型是整数型

结果：

列表： [1, 2, 3]
Array:  [1 2 3]
(3,)
1
int32

array的运算

1 2	#array的运算 np.array([1, 2, 3]) -np.array([4, 5, 6])

结果：

1	array([-3, -3, -3])

建立ndarray数据，第三个参数是步长（step）

#建立ndarray数据，第三个参数是步长（step）
n = np.arange(0, 30, 2)
print(n)
print('-----------分割----------')
# reshape的用法，比较重要
n = n.reshape(3, 5)
print('reshape后: ')
print(n)

结果：

[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28]
-----------分割----------
reshape后: 
[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]]

单位矩阵等初始化

print('ones:\n', np.ones((3, 2)))
print('zeros:\n', np.zeros((3, 2)))
print('eye:\n', np.eye(3))
print('diag:\n', np.diag(my_list))
print('-----分割---------')
a = np.zeros((3,4),dtype = int)   #第一个参数是元组，用来指定大小，dtype用来指定数据类型,empty慎用
print(a)

结果：

ones:
 [[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]
zeros:
 [[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]
eye:
 [[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
diag:
 [[1 0 0]
 [0 2 0]
 [0 0 3]]
-----分割---------
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]

repeat和*的区别

1
2
3

# repeat和*的区别
print('*操作：\n', np.array([1, 2, 3] * 3))
print('repeat：\n', np.repeat([1, 2, 3], 3))

结果：

*操作：
 [1 2 3 1 2 3 1 2 3]
repeat：
 [1 1 1 2 2 2 3 3 3]

矩阵的叠加

p1 = np.ones((3, 3))
print(p1)
print('-------分割----------')
p2 = np.arange(9).reshape(3, 3)
print(p2)
print('纵向叠加: \n', np.vstack((p1, p2)))
print('横向叠加: \n', np.hstack((p1, p2)))

结果：

[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
-------分割----------
[[0 1 2]
 [3 4 5]
 [6 7 8]]
纵向叠加: 
 [[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
横向叠加: 
 [[ 1.  1.  1.  0.  1.  2.]
 [ 1.  1.  1.  3.  4.  5.]
 [ 1.  1.  1.  6.  7.  8.]]

3.3 Array操作

print('p1: \n', p1)   #p1和p2都引用了上面p1和p2
print('p2: \n', p2)

print('p1 + p2 = \n', p1 + p2)   #对两个数组进行四则运算操作
print('p1 * p2 = \n', p1 * p2)
print('p2^2 = \n', p2 ** 2)
print('p1.p2 = \n', p1.dot(p2))

结果：

p1: 
 [[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
p2: 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
p1 + p2 = 
 [[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]]
p1 * p2 = 
 [[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
p2^2 = 
 [[ 0  1  4]
 [ 9 16 25]
 [36 49 64]]
p1.p2 = 
 [[  9.  12.  15.]
 [  9.  12.  15.]
 [  9.  12.  15.]]

转换数组类型(astype)

#数组的转置
p3 = np.arange(6).reshape(2, 3)
print('p3形状: ', p3.shape)
print(p3)
p4 = p3.T
print('转置后p3形状: ', p4.shape)
print(p4)

结果：

p3形状:  (2, 3)
[[0 1 2]
 [3 4 5]]
转置后p3形状:  (3, 2)
[[0 3]
 [1 4]
 [2 5]]

数据类型的查看以及对数据类型的转换（astype）

#数据类型的查看以及对数据类型的转换（astype）
print('p3数据类型:', p3.dtype)
print(p3)

p5 = p3.astype('float')
print('p5数据类型:', p5.dtype)
print(p5)

结果：

p3数据类型: int32
[[0 1 2]
 [3 4 5]]
p5数据类型: float64
[[ 0.  1.  2.]
 [ 3.  4.  5.]]

对于数据基本的操作，比如求和，最小最大，平均，方差等

#对于数据基本的操作，比如求和，最小最大，平均，方差等
a = np.array([-5,-3,-2 , -6, 3, 5])
print('sum: ', a.sum())
print('min: ', a.min())
print('max: ', a.max())
print('mean: ', a.mean())
print('std: ', a.std())
print('argmax: ', a.argmax())   #返回最大值所在的index（索引值）
print('argmin: ', a.argmin())   #返回最小值所在的index（索引值）

#提示：注意多维的话，要指定统计的维度，否则默认是全部维度上做统计

结果：

sum:  -8
min:  -6
max:  5
mean:  -1.33333333333
std:  4.0276819912
argmax:  5
argmin:  3

索引与切片

一维数组的索引与Python的列表索引功能相似
多维数组的索引
- arr[r1:r2, c1:c2]
- arr[1,1]等价arr[1][1]
- [:]代表某个维度的数据

# 一维array
s = np.arange(13) ** 2
print('s: ', s)
print('s[0]: ', s[0])
print('s[4]: ', s[4])
print('s[0:3]: ', s[0:3])
print('s[[0, 2, 4]]: ', s[[0, 2, 4]]) 
print('s[0:9:4]:',s[0:9:4])  #第三个参数是步长

结果：

s:  [  0   1   4   9  16  25  36  49  64  81 100 121 144]
s[0]:  0
s[4]:  16
s[0:3]:  [0 1 4]
s[[0, 2, 4]]:  [ 0  4 16]
s[0:9:4]: [ 0 16 64]

二维array

# 二维array
r = np.arange(36).reshape((6, 6))
print('r: \n', r)
print('r[2, 2]: \n', r[2, 2])
print('r[3, 3:6]: \n', r[3, 3:6]) #第三行的，index3、4、5的值

结果：

r: 
 [[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 31 32 33 34 35]]
r[2, 2]: 
 14
r[3, 3:6]: 
 [21 22 23]

条件索引
布尔值多维数组 arr[condition]condition可以是多个条件组合
注意，多个条件组合要使用& |，而不是and or

以上是根据条件将数据转换成布尔型true和false的情况，然后再把true所对应的值取出来。

1	r > 30 #引用上面的r的数组数据类型：bool（布尔型）

结果：

array([[False, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False,  True,  True,  True,  True,  True]], dtype=bool)

数据的过滤

# 过滤
print(r[r > 30])  #将大于30的取出来

# 将大于30的数赋值为30
r[r > 30] = 30       
print(r)

结果：

[31 32 33 34 35]
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 30 30 30 30 30]]

array的拷贝操作

arr1 = arr2
arr1内数据的更改会影响arr2
建议使用arr1 = arr2.copy()，这样就不会互相影响

1
2
3

# copy()操作  #仍然引用上面的r数组
r2 = r[:3, :3]
print(r2)

结果：

1
2
3

[[ 0  1  2
 [ 6  7  8]
 [12 13 14]]

对r2赋值，查看对原数组是否会影响

# 将r2内容设置为0
r2[:] = 0

# 查看r的内容  
print(r)         #可以看出r2的改变，是会对r有影响的

结果：

[[ 0  0  0  3  4  5]
 [ 0  0  0  9 10 11]
 [ 0  0  0 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 30 30 30 30 30]]

通过copy操作后对r3赋值，查看对原数组是否有影响

r3 = r.copy()  #对r进行copy操作---->r3
r3[:] = 0
print(r)
print('-------分割----------')
print(r3)   #可以观察出r3的修改对原来的r没有影响

结果：

[[ 0  0  0  3  4  5]
 [ 0  0  0  9 10 11]
 [ 0  0  0 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 30 30 30 30 30]]
-------分割----------
[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

遍历Array

1 2	t = np.random.randint(0, 10, (4, 3)) #0-10随机一个4行3列的数组 print(t)

结果：

[[5 1 9]
 [2 5 1]
 [3 7 9]
 [4 8 1]]

遍历每行操作

1 2	for row in t: #遍历每行 print(row)

结果：

[5 1 9]
[2 5 1]
[3 7 9]
[4 8 1]

使用enumerate()

1
2
3

# 使用enumerate()
for i, row in enumerate(t):
    print('row {} is {}'.format(i, row))

结果：

row 0 is [5 1 9]
row 1 is [2 5 1]
row 2 is [3 7 9]
row 3 is [4 8 1]

将所有数据平方

1 2	t2 = t ** 2 #将数据全部平方 print(t2)

结果：

[[25  1 81]
 [ 4 25  1]
 [ 9 49 81]
 [16 64  1]]

使用zip对两个array进行遍历计算

1
2
3

# 使用zip对两个array进行遍历计算
for i, j in zip(t, t2):     #zip就是将列
    print('{} + {} = {}'.format(i, j, i + j))

结果：

[5 1 9] + [25  1 81] = [30  2 90]
[2 5 1] + [ 4 25  1] = [ 6 30  2]
[3 7 9] + [ 9 49 81] = [12 56 90]
[4 8 1] + [16 64  1] = [20 72  2]

取出唯一值

1
2
3

a = list([1,1,1,1,1,2,2,2,2,3,3,3,4])  #取出唯一的值
b = np.unique(a)
print(b)

结果：

[1 2 3 4]

介绍