파이썬 판다스(pandas) 데이터 처리 속도 비교

python 2020. 12. 19. 19:21

판다스를 사용할때 데이터프레임(dataframe)에 행데이터(row data)를 연속적으로 쌓아나가야 하는 경우가 있다.

매우 많은 행들을 쌓아나가는 작업을 할 때 (Adding a lot of rows to dataframe)

어떻게 하면 속도측면에서 효율적일까?

(1) 데이터프레임 생성 후 행데이터를 append로 삽입한다.

(2) 데이터프레임 생성 후 행데이터를 concat으로 삽입한다.

(3) 데이터프레임 생성 후 행데이터를 loc인덱싱하여 삽입한다.

(4) dict 타입으로 행데이터를 append하고 완료 후 데이터프레임을 생성한다.

(5) list타입으로 행데이터를 append하고 완료 후 데이터프레임을 생성한다.

* 속도측정 결과(in secs)

방식	1000 rows	5000 rows	10000 rows
(1),(2)	0.69	3.37	6.77
(3)	0.73	3.87	8.14
(4),(5)	0.011	0.046	0.088

(1),(2),(3)번은 데이터프레임 생성하고 데이터프레임의 행붙이기 명령(append, concat 등)을 이용하는 방식이다.

(4),(5)번은 먼저 리스트 또는 사전형(dictionary)타입으로 데이터를 쌓고 마지막에 데이터프레임을 생성하는 방식이다.

결론은 (4),(5)번 방식이 (1),(2),(3)에 비해 훨씬 빠르다. 많은 양을 연속적으로 처리함에 있어서 데이터프레임에서 제공하는 명령어는 가급적 삼가하자.

실험한 코드는 다음과 같다.

import numpy as np
import pandas as pd
import time

numOfRows = 5000 #number of rows

######### (1)
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range(numOfRows-5):
    df1 = df1.append(dict((a,np.random.randint(100)) for a in ['A','B','C','D','E']), 
                     ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'
      .format(time.perf_counter() - startTime, numOfRows))

######### (2)
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range(numOfRows-5):
    df2= pd.DataFrame(dict((a,[np.random.randint(100)]) for a in ['A','B','C','D','E']))
    df1 = pd.concat([df1, df2], axis=0)
print('Elapsed time: {:6.3f} seconds for {:d} rows'
      .format(time.perf_counter() - startTime, numOfRows))

######### (3)
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range(5, numOfRows):
    df1.loc[i]  = np.random.randint(100, size=(1,5))[0]    
print('Elapsed time: {:6.3f} seconds for {:d} rows'
      .format(time.perf_counter() - startTime, numOfRows))

######### (4)
startTime = time.perf_counter()
row_list = []
for i in range(numOfRows):
    dict1 = dict((a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)
df1 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'
      .format(time.perf_counter() - startTime, numOfRows))
      
######### (5)
startTime = time.perf_counter()
row_list = []
for i in range(numOfRows):
    dict1= [np.random.randint(100) for a in range(5)]
    row_list.append(dict1)
df1 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'
      .format(time.perf_counter() - startTime, numOfRows))

마침.

저작자표시

'python' 카테고리의 다른 글

파이썬 conda 및 pip 명령어 (1)	2020.12.20
파이썬 판다스(pandas) 요약 정리 (1)	2020.12.19
파이썬 코드 스타일 (0)	2020.12.19
파이썬 Naming Rule (0)	2020.12.19
파이썬 시작하기 (설치 및 기타) (2)	2020.12.12

ABOUT ME

도와리의 기술이야기 도와리의 기술이야기

'python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바