프로그래밍언어/R기초
R - 3 . dplyr()함수 : 정렬하기(arrange함수),최상위출력(top_n)
윤채니챈
2023. 6. 23. 00:09
728x90
반응형
• 사용데이터
https://www.kaggle.com/datasets/mohithsairamreddy/salary-data
Salary_Data
Salary Data based on Experience,Age,Gender,Job Title and Education Level
www.kaggle.com
- 데이터 로드
dir = '/Users/yunchaewon/Desktop/r-data analysis/Rbasic/'
df = read.csv(paste0(dir,'Salary_Data.csv'))
str(df)
-데이터설명
str(df)
'data.frame': 6704 obs. of 6 variables:
$ Age : int 32 28 45 36 52 29 42 31 26 38 ...
$ Gender : chr "Male" "Female" "Male" "Female" ...
$ Education.Level : chr "Bachelor's" "Master's" "PhD" "Bachelor's" ...
$ Job.Title : chr "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
$ Years.of.Experience: num 5 3 15 7 20 2 12 4 1 10 ...
$ Salary : int 90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...
- Age (나이): 정수형 변수
- Gender (성별): 문자열 변수
- Education.Level (학력 수준): 문자열 변수 -> "Bachelor's"는 학사 학위, "Master's"는 석사 학위, "PhD"는 박사 학위
- Job.Title (직위): 문자열 변수, 개인의 직위를 나타낸다. 예를 들어 "Software Engineer"는 소프트웨어 엔지니어, "Data Analyst"는 데이터 분석가를 의미
- Years.of.Experience (경력 연수): 숫자형 변수로, 개인의 경력 연수. 소수점을 포함할 수 있다
- Salary (연봉): 정수형 변수로, 개인의 연봉
arrange()
- 데이터 프레임의 행을 정렬하는 데 사용
• 오름차순
df1 = df%>%
arrange(Age)
df1
결과 : Age가 오름차순이것을 볼 수 잇음
Age Gender Education.Level Job.Title Years.of.Experience Salary
1 21 Female High School Junior Sales Representative 0.0 25000
2 21 Female High School Junior Sales Representative 0.0 25000
3 21 Female High School Junior Sales Representative 0.0 25000
4 21 Female High School Junior Sales Representative 0.0 25000
5 21 Female High School Junior Sales Representative 0.0 25000
6 21 Female High School Junior Sales Representative 0.0 25000
7 21 Female High School Junior Sales Representative 0.0 25000
8 21 Female High School Junior Sales Representative 0.0 25000
9 21 Female High School Junior Sales Representative 0.0 25000
10 21 Female High School Junior Sales Representative 0.0 25000
11 21 Female High School Junior Sales Representative 0.0 25000
12 21 Female High School Junior Sales Representative 0.0 25000
13 21 Female High School Junior Sales Representative 0.0 25000
14 21 Female High School Junior Sales Representative 0.0 25000
15 21 Female High School Junior Sales Representative 0.0 25000
16 21 Female High School Junior Sales Representative 0.0 25000
17 21 Female High School Junior Sales Representative 0.0 25000
18 21 Female High School Junior Sales Representative 0.0 25000
19 22 Male Bachelor's Degree Front End Developer 1.0 50000
20 22 Female High School Back end Developer 0.0 51832
21 22 Female High School Back end Developer 0.0 51832
22 22 Male Bachelor's Degree Software Engineer 1.0 50000
• 내림차순 : arrange( - 변수)
df2 = df%>%
arrange(-Age)
df2
• gorup_by 후, arrange
df1 = df%>%
group_by(Age,Gender)%>%
summarise(count = n())%>%
arrange(count)
df1
결과
# A tibble: 85 × 3
# Groups: Age [42]
Age Gender count
<int> <chr> <int>
1 58 "Female" 1
2 60 "Male" 1
3 23 "Other" 2
4 25 "Other" 2
5 31 "Other" 2
6 37 "Other" 2
7 53 "Other" 2
8 56 "Female" 2
9 61 "Male" 2
10 NA "" 2
# ℹ 75 more rows
# ℹ Use `print(n = ...)` to see more rows
top_n ()
- 데이터 프레임에서 특정 변수를 기준으로 상위 N개의 관측치를 선택하는 데 사용
함수로드방법
top_n(data, n, wt, ...)
- data: 데이터 프레임
- n: 선택할 상위 관측치의 개수
- wt: 선택 기준이 될 변수
- ..:추가적인 선택적인 인수로, 필터링 조건 등을 포함
1. arrange 후 topn_n적용
df3 = df%>%
arrange(-Salary)%>%
top_n(n=10,wt=Salary)
df3
결과 : Salary(급여) 내림차순으로 정렬된 10개의 상위데이터추출
Age Gender Education.Level Job.Title Years.of.Experience Salary
1 50 Male Bachelor's CEO 25 250000
2 52 Male PhD Chief Technology Officer 24 250000
3 45 Male Bachelor's Degree Financial Manager 21 250000
4 51 Male PhD Data Scientist 24 240000
5 51 Male PhD Data Scientist 24 240000
6 51 Male PhD Data Scientist 24 240000
7 51 Male PhD Data Scientist 24 240000
8 51 Male PhD Data Scientist 24 240000
9 51 Male PhD Data Scientist 24 240000
10 51 Male PhD Data Scientist 24 240000
11 51 Male PhD Data Scientist 24 240000
2. group_by 후 tonp_n적용
df5 = df%>%
group_by(Gender)%>%
top_n(n =5,wt=Age)
df5
결과 : group_by적용된 관측지 각각에 topn_이 적용됨
df5의 경우 Gender(other,male,female) 각각에 top_n(Age 나이에따라)가 적용됨을 알 수 있음
***n = 3으로 설정했을 때 동일한 조건을 가진 사람이 3명 이상인 경우에는 전체 관측치가 출력
A tibble: 16 × 6
# Groups: Gender [3]
Age Gender Education.Level Job.Title Years.of.Experience Salary
<int> <chr> <chr> <chr> <dbl> <int>
1 62 Male PhD Software Engineer Manager 19 200000
2 62 Male PhD Software Engineer Manager 20 200000
3 62 Male PhD Software Engineer Manager 19 200000
4 62 Male PhD Software Engineer Manager 20 200000
5 62 Male PhD Software Engineer Manager 19 200000
6 58 Female PhD Software Engineer Manager 17 195000
7 53 Other High School Senior Project Engineer 31 166109
8 60 Female PhD Software Engineer Manager 33 179180
9 60 Female PhD Software Engineer Manager 34 188651
10 53 Other High School Senior Project Engineer 31 166109
11 60 Female PhD Software Engineer Manager 33 179180
12 60 Female PhD Software Engineer Manager 34 188651
13 54 Other High School Senior Software Engineer 29 158254
14 54 Other High School Senior Software Engineer 29 158966
15 54 Other High School Senior Software Engineer 29 158966
16 54 Other High School Senior Software Engineer 29 158966
-> group_by가 되어있기때문에 위 결과를 ungroup()으로 그룹화를 풀어준다
3.group_by 후 tonp_n적용 한 결과값 ubgroup
df5 = df%>%
group_by(Gender)%>%
top_n(n =5,wt=Age)%>%
ungroup()
728x90
반응형