-
[파이썬]Pandas groupby를 이용해 같은 항목을 묶어보자Pandas 2021. 2. 4. 04:23반응형
Untitled4 groupby¶
파이썬 데이터프레임에서 같은 셀을 가지는 항목끼리 묶어서 사용할 수 있는 groupby함수를 사용해보도록 하겠습니다
In [61]:import pandas as pd df = pd.read_csv('titanic.csv') df
Out[61]:PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q 891 rows × 12 columns
편의를 위해 파일에서 이름, 성별, 형제수, 운임료만을 뽑아서 사용하겠습니다.
In [62]:df = df[['Name', 'Sex', 'SibSp', 'Fare']] df
Out[62]:Name Sex SibSp Fare 0 Braund, Mr. Owen Harris male 1 7.2500 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 71.2833 2 Heikkinen, Miss. Laina female 0 7.9250 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 53.1000 4 Allen, Mr. William Henry male 0 8.0500 ... ... ... ... ... 886 Montvila, Rev. Juozas male 0 13.0000 887 Graham, Miss. Margaret Edith female 0 30.0000 888 Johnston, Miss. Catherine Helen "Carrie" female 1 23.4500 889 Behr, Mr. Karl Howell male 0 30.0000 890 Dooley, Mr. Patrick male 0 7.7500 891 rows × 4 columns
성별을 남자, 여자로 나눠 각 항목에 해당하는 정보들을 출력해보도록 하겠습니다.
In [64]:groupby_sex = df.groupby('Sex') groupby_sex.groups
Out[64]:{'female': [1, 2, 3, 8, 9, 10, 11, 14, 15, 18, 19, 22, 24, 25, 28, 31, 32, 38, 39, 40, 41, 43, 44, 47, 49, 52, 53, 56, 58, 61, 66, 68, 71, 79, 82, 84, 85, 88, 98, 100, 106, 109, 111, 113, 114, 119, 123, 128, 132, 133, 136, 140, 141, 142, 147, 151, 156, 161, 166, 167, 172, 177, 180, 184, 186, 190, 192, 194, 195, 198, 199, 205, 208, 211, 215, 216, 218, 229, 230, 233, 235, 237, 240, 241, 246, 247, 251, 254, 255, 256, 257, 258, 259, 264, 268, 269, 272, 274, 275, 276, ...], 'male': [0, 4, 5, 6, 7, 12, 13, 16, 17, 20, 21, 23, 26, 27, 29, 30, 33, 34, 35, 36, 37, 42, 45, 46, 48, 50, 51, 54, 55, 57, 59, 60, 62, 63, 64, 65, 67, 69, 70, 72, 73, 74, 75, 76, 77, 78, 80, 81, 83, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 101, 102, 103, 104, 105, 107, 108, 110, 112, 115, 116, 117, 118, 120, 121, 122, 124, 125, 126, 127, 129, 130, 131, 134, 135, 137, 138, 139, 143, 144, 145, 146, 148, 149, 150, 152, 153, 154, 155, ...]}
먼저 성별을 기준으로 나눠주었구요
In [66]:for name, group in groupby_sex: print(name + ' : ' + str(len(group))) print(group) print()
female : 314 Name Sex SibSp Fare 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 71.2833 2 Heikkinen, Miss. Laina female 0 7.9250 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 53.1000 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 0 11.1333 9 Nasser, Mrs. Nicholas (Adele Achem) female 1 30.0708 .. ... ... ... ... 880 Shelley, Mrs. William (Imanita Parrish Hall) female 0 26.0000 882 Dahlberg, Miss. Gerda Ulrika female 0 10.5167 885 Rice, Mrs. William (Margaret Norton) female 0 29.1250 887 Graham, Miss. Margaret Edith female 0 30.0000 888 Johnston, Miss. Catherine Helen "Carrie" female 1 23.4500 [314 rows x 4 columns] male : 577 Name Sex SibSp Fare 0 Braund, Mr. Owen Harris male 1 7.2500 4 Allen, Mr. William Henry male 0 8.0500 5 Moran, Mr. James male 0 8.4583 6 McCarthy, Mr. Timothy J male 0 51.8625 7 Palsson, Master. Gosta Leonard male 3 21.0750 .. ... ... ... ... 883 Banfield, Mr. Frederick James male 0 10.5000 884 Sutehall, Mr. Henry Jr male 0 7.0500 886 Montvila, Rev. Juozas male 0 13.0000 889 Behr, Mr. Karl Howell male 0 30.0000 890 Dooley, Mr. Patrick male 0 7.7500 [577 rows x 4 columns]
groupby를 사용하면 여러 통계정보도 알 수 있는데요 대표적으로 평균, 표준편차등을 알 수 있습니다.
In [67]:df.groupby('Sex').mean()
Out[67]:SibSp Fare Sex female 0.694268 44.479818 male 0.429809 25.523893 In [68]:df.groupby(['Sex', 'SibSp']).mean()
Out[68]:Fare Sex SibSp female 0 38.759867 1 49.401219 2 63.750315 3 67.018191 4 27.402083 5 46.900000 8 69.550000 male 0 20.452848 1 38.740497 2 41.356667 3 73.068340 4 34.082292 5 46.900000 8 69.550000 In [69]:df.groupby('Sex').size()
Out[69]:Sex female 314 male 577 dtype: int64
또한 agg(aggregation)함수를 이용해 좀더 활용도 높게 사용할 수 있습니다.
직접 함수를 만들어 대입해줄수도 있습니다.In [73]:df.groupby('Sex').agg(np.mean)
Out[73]:SibSp Fare Sex female 0.694268 44.479818 male 0.429809 25.523893 In [74]:def my_mean(s): return np.mean(s)+1 df.groupby('Sex').agg({'SibSp': my_mean, 'Fare': my_mean})
Out[74]:SibSp Fare Sex female 1.694268 45.479818 male 1.429809 26.523893 반응형'Pandas' 카테고리의 다른 글
[파이썬]Pandas 행,열 생성 및 수정하기 (0) 2021.02.04 [파이썬]Pandas 행과 열 선택하기 (0) 2021.02.04 [파이썬]Pandas DataFrame만들기 (0) 2021.02.04 [파이썬] DataFrame에서 누락값 처리하기 (0) 2021.01.15 [파이썬] 판다스 DataFrame에서 열 선택하기 (0) 2021.01.08