[Warehouse] Pandas Skills

판다스에 대한 자주 사용되는 기술들을 모아두었습니다.

0. 라이브러리 import 및 DataFrame 생성

목차 / 중요도

1. Query ⭐⭐⭐

2. insert new column ⭐⭐

3. Cumsum ⭐⭐⭐

4. Sampling ⭐

5. Where ⭐⭐

6. isin ⭐⭐⭐

7. pct_change & rank ⭐

8. Melt ⭐

9. nunique ⭐⭐

10. object type & memory usage ⭐

11. Replace ⭐⭐

12. Coloring ⭐

13. tqdm

0. 라이브러리 import 및 DataFrame 생성

먼저 필요한 라이브러리를 임포트하고 데이터 프레임을 만들어줍니다.

import numpy as np
import pandas as pd

values_1 = np.random.randint(10, size=10)
values_2 = np.random.randint(10, size=10)
years = np.arange(2010,2020)
groups = ['A','A','B','A','B','B','C','A','C','C']
df = pd.DataFrame({'group':groups, 'year':years, 'value_1':values_1, 'value_2':values_2})
df

1. Query ⭐⭐⭐

마치 DB의 Table처럼 쿼리를 작성하면 된다.

# 1. Query
df.query('value_1 < value_2')
temp = transactions.query('year==2014 and 5< month <9')

2. insert new column ⭐⭐

#new column
new_col = np.random.randn(10)

#insert the new column at position 2
df.insert(2, 'new_col', new_col) 

def highlight_cols(s):
    color = 'yellow'
    return 'background-color: %s' % color

df.style.applymap(highlight_cols, subset=pd.IndexSlice[:, ['new_col']])

3. Cumsum ⭐⭐⭐

# 특정 그룹에 대하여 Sum을 한 부분을 새로 column에 보관

df['cumsum_2'] = df[['value_2','group']].groupby('group').cumsum()

df.groupby(['genreAlt']).agg({'미국_평점_IMDB':['mean','count']})
df

4. Sampling ⭐

# Sample 
sample1 = df.sample(n=3)
sample1

# Sample 비율로 조절 
sample2 = df.sample(frac=0.5)
sample2

5. Where ⭐⭐

df['new_col'].where(df['new_col'] > 0 , -1)

# numpy 를 이용한 것과 결과가 같다. 
# np.where(df['new_col'] > 0, df['new_col'], -1)

6. isin ⭐⭐⭐

# isin
years = ['2010','2014','2017']
df[df.year.isin(years)]

7. pct_change & rank ⭐

# 연속적인 값이라면 변화하는 비율을 나타냄
df.value_1.pct_change()

# 랭크 구하기
df['rank_1'] = df['value_1'].rank()
df

8. Melt & Merge & Join ⭐

# Melt
df_wide = pd.DataFrame({'city':['A','B','C'], 'day1':[22,25,28]
                        , 'day2':[12,15,18], 'day3':[42,35,28]
                        , 'day4':[24,20,23], 'day4':[21,23,25]})

df_wide.melt(id_vars=['city'])


# Explode
df1 = pd.DataFrame({'ID':['a','b','c'], 'measurement':[4,8,[2,3,8]], 
                    'day':[1,1,1]})
df1.explode('measurement').reset_index(drop=True)

# join by index
df_made = pd.DataFrame(index=df.index).join(data.set_index('df_index'))
df_made

9. nunique ⭐⭐

# Nunique

df.year.nunique()

10. object type & memory usage ⭐

# infer objects()
df.infer_objects().dtypes

# 특정 데이터 타입 뽑기
df.select_dtypes(include='int64')

# Memory usage
df_large = pd.DataFrame({'A': np.random.randn(1000000),
                    'B': np.random.randint(100, size=1000000)})
df_large.memory_usage()


# Mega byte 변환
df_large.memory_usage().sum() / (1024**2) #converting to megabytes

11. Replace ⭐⭐

df.replace('A', 'A_1')
df.replace({'A':'A_1', 'B':'B_1'})

12. Coloring ⭐

def color_negative_values(val):
   color = 'red' if val < 5 else 'black'
   return 'color: %s' % color

df[['value_1', 'value_2']].style.applymap(color_negative_values)

13. Make Multiple columns⭐⭐

def trim(x):
    try:
        if x:
            x = str(x)
            if '전국' in x:
                x = x.split()
                try:
                    return x[1], x[2], x[4]
                except:
                    return None,None,None
            else:
                return None,None,None
        else:
            return None,None,None
    except:
        print(x)
        return None,None,None
        
df['n_screen'], df['revenu_ko'], df['n_people'] = zip(*df_kobis.stats.apply(trim))

14. Apply 함수에 tqdm 적용하기 ⭐⭐

기존 apply 대신에 progress_apply() 를 사용해주시면 됩니다.

from tqdm import tqdm  
tqdm.pandas(position=0, leave=True)

df['리뷰_ko'] = df['리뷰_ko'].progress_apply(trim_ko)

+ for 문에 tqdm 적용하기

from tqdm import tqdm

for i in tqdm(range(30)):
    print(i)

+ Colab print 오류 해결 버전

# Try 1
import tqdm.notebook as tq
for i in tq.tqdm(range(len(im))):

# Try 2
from tqdm.auto import tqdm
tqdm.pandas()

15. Apply 함수에 multiprocessing 적용하기⭐

!pip install swifter

import swifter

df['스토리_ko_trim'] = df['스토리_ko'].swifter.apply(trim_ko)

15. Warning 제거

import warnings 
warnings.filterwarnings('ignore')

저작자표시

'데이터분석' 카테고리의 다른 글

Frequentist vs Bayesian (0)	2020.08.03
Spark 기본 설명 (0)	2020.07.30
[Word Cloud] Mask를 이용한 Word Cloud + Python (0)	2020.07.17
[차트] 원형 차트 시각화 🤖 (0)	2020.07.14
DB Oracle SQL 구문 (0)	2020.07.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Rudi

[Warehouse] Pandas Skills

0. 라이브러리 import 및 DataFrame 생성

0. 라이브러리 import 및 DataFrame 생성

1. Query ⭐⭐⭐

2. insert new column ⭐⭐

3. Cumsum ⭐⭐⭐

4. Sampling ⭐

5. Where ⭐⭐

6. isin ⭐⭐⭐

7. pct_change & rank ⭐

8. Melt & Merge & Join ⭐

9. nunique ⭐⭐

10. object type & memory usage ⭐

11. Replace ⭐⭐

12. Coloring ⭐

13. Make Multiple columns⭐⭐

14. Apply 함수에 tqdm 적용하기 ⭐⭐

+ for 문에 tqdm 적용하기

+ Colab print 오류 해결 버전

15. Apply 함수에 multiprocessing 적용하기⭐

15. Warning 제거

'데이터분석' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역