본문 바로가기

부스트캠프 AI Tech/Data Viz

[02] Bar Plot

Bar Plot

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

barh 와 bar color 주기

# fig = plt.figure(12,7)
# axes = fig.subplots(1,2)
fig, axes = plt.subplots(1,2, figsize=(12,7))

x = list('ABCED')
y = list(range(1,6))

clist = ['tomato', 'g', 'r', 'm', 'b']

axes[0].bar(x,y, color = clist) # 리스트로 개별 막대 색 주기
axes[1].barh(x,y, color = 'k')  # 모든 막대 색

plt.show()


데이터 분석 해보기

In [3]:

data = pd.read_csv('./StudentsPerformance.csv')
data.info()
print(data.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
(1000, 8)

In [4]:

data.head(5)

Out[4]:

gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75

성별 별 race/ethnity 그룹 수 구하기

In [5]:

group = data.groupby('gender')['race/ethnicity'].value_counts().sort_index()
# Series기 때문에 sort_index 사용. sort_values 해서 오류.. 
group

Out[5]:

gender  race/ethnicity
female  group A            36
        group B           104
        group C           180
        group D           129
        group E            69
male    group A            53
        group B            86
        group C           139
        group D           133
        group E            71
Name: race/ethnicity, dtype: int64

In [6]:

fig, axex = plt.subplots(1, 2, figsize=(15,7))
axex[0].bar(group['male'].index, group['male'], color = 'royalblue')
axex[1].bar(group['female'].index, group['female'], color = 'tomato')

# group['male'] 은 pandas 시리즈인데도 가능
# group.male.values 로 array 형태로만 넣었었는데 시리즈도 가능했음

plt.show()

두 그래프를 비교할 때 세로축 범위가 다르므로 비교가 어려움

In [7]:

# 방법 1 : sharey 파라미터 사용
fig, axes = plt.subplots(1, 4, figsize=(15,7), sharey=True)
axes[0].bar(group['male'].index, group['male'], color = 'royalblue')
axes[1].bar(group['female'].index, group['female'], color = 'tomato')

# 방법 2  : 반복문 set_ylim
axes[2].bar(group['male'].index, group['male'], color = 'royalblue')
axes[3].bar(group['female'].index, group['female'], color = 'tomato')

for ax in axes:
    ax.set_ylim(0,200)
plt.show()


Stacked Bar Plot

group 수를 bar plot 하는데 남 여 합쳐서 보여주고싶을때 bottom 파라미터를 이용해 stack 하거나 alpha 파라미터로 투명도 조정하여 겹쳐서 plot

In [8]:

data.head(3)
data['race/ethnicity'].value_counts().sort_index()

Out[8]:

group A     89
group B    190
group C    319
group D    262
group E    140
Name: race/ethnicity, dtype: int64

bottom 파라미터를 이용. 아래공간을 비워둔다.

In [9]:

group_cnt = data['race/ethnicity'].value_counts().sort_index()
fig, axes = plt.subplots(1,2, figsize= (12,7), sharey=True)
axes[0].bar(group_cnt.index,group_cnt,color='darkgray')
axes[1].bar(group.male.index, group.male, color='blue')
axes[1].bar(group['female'].index, group['female'], bottom=group['male'], color='red')

plt.show()

In [13]:

fig, ax = plt.subplots(1, 1, figsize=(12,7))

ax.barh(group.male.index, group.male.values / group_cnt.values)
ax.barh(group.female.index, group.female.values / group_cnt.values, left=group.male.values / group_cnt.values)

for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)

plt.show()

ax.spines 알기

spines는 축 커스터마이징 할때 사용 dict 형태로 되어있으며 key는 top, bottom, left, right tick_params 와 보통 같이 쓰임

Method

  • set_visible(False)
  • set_position('cender') or set_position('data', 1)
  • set_linewidth(2)
  • set_alpha(0.5) #투명도
  • set_color('navy')

Grouped Bar Plot

x축 -> width -> xticks, xticklabels

x축 조정법

  • 2개 : -1/2, +1/2
  • 3개 : -1, 0, +1 (-2/2, 0, +2/2)
  • 4개 : -3/2, -1/2, +1/2, +3/2
  • 공식. index i 구하기
    $x+\frac{-N+1+2\times i}{2}\times width$

In [17]:

fig, ax = plt.subplots(1, 1, figsize=(12, 7))

idx = np.arange(len(group['male'].index))
width=0.35

ax.bar(idx-width/2, group['male'], 
       color='royalblue',
       width=width,
       label='Male')

ax.bar(idx+width/2, group['female'], 
       color='tomato',
       width=width,
       label='female')

ax.set_xticks(idx)
ax.set_xticklabels(group['male'].index)
ax.legend()

plt.show()

수치 text 추가하기

In [19]:

fig, axes = plt.subplots(1, 2, figsize=(15, 7))

for ax in axes:
    ax.bar(group_cnt.index, group_cnt,
           width=0.7,
           edgecolor='black',
           linewidth=2,
           color='royalblue',
           zorder=10
          )

    ax.margins(0.1, 0.1)

    for s in ['top', 'right']:
        ax.spines[s].set_visible(False)

axes[1].grid(zorder=0)

for idx, value in zip(group_cnt.index, group_cnt):
    axes[1].text(idx, value+5, s=value,
                 ha='center', 
                 fontweight='bold'
                )


errorabar 사용하기

먼저 pandas aggregation 함수 복습

  • mean(): Compute mean of groups
  • sum(): Compute sum of group values
  • size(): Compute group sizes
  • count(): Compute count of group
  • std(): Standard deviation of groups 표준편차
  • var(): Compute variance of groups 분산
  • sem(): Standard error of the mean of groups 표준오차
  • describe(): Generates descriptive statistics
  • first(): Compute first of group values
  • last(): Compute last of group values
  • nth() : Take nth value, or a subset if n is a list
  • min(): Compute min of group values
  • max(): Compute max of group values

yerr 파라미터 이용 ¶

In [36]:

score_var = data.groupby('gender').std().T
score_var

Out[36]:

gender female male
math score 15.491453 14.356277
reading score 14.378245 13.931832
writing score 14.844842 14.113832

In [37]:

fig, ax = plt.subplots(1, 1, figsize=(10, 10))

idx = np.arange(len(score_var.index))
width=0.3


ax.bar(idx-width/2, score['male'], 
       color='royalblue',
       width=width,
       label='Male',
       yerr=score_var['male'],
       capsize=10
      )

ax.bar(idx+width/2, score['female'], 
       color='tomato',
       width=width,
       label='Female',
       yerr=score_var['female'],
       capsize=10
      )

ax.set_xticks(idx)
ax.set_xticklabels(score.index)
ax.set_ylim(0, 100)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.legend()
ax.set_title('Gender / Score', fontsize=20)
ax.set_xlabel('Subject', fontweight='bold')
ax.set_ylabel('Score', fontweight='bold')

plt.show()

In [ ]:

'부스트캠프 AI Tech > Data Viz' 카테고리의 다른 글

[03] Line Plot  (0) 2022.01.09
[01] matplotlib  (0) 2022.01.07
[00] Markdown  (0) 2022.01.07