R - 3 . dplyr()함수 : 정렬하기(arrange함수),최상위출력(top

프로그래밍언어/R기초

R - 3 . dplyr()함수 : 정렬하기(arrange함수),최상위출력(top_n)

윤채니챈 2023. 6. 23. 00:09

728x90

• 사용데이터

https://www.kaggle.com/datasets/mohithsairamreddy/salary-data

Salary_Data

Salary Data based on Experience,Age,Gender,Job Title and Education Level

www.kaggle.com

- 데이터 로드

dir = '/Users/yunchaewon/Desktop/r-data analysis/Rbasic/'

df = read.csv(paste0(dir,'Salary_Data.csv'))
str(df)

-데이터설명

str(df)
'data.frame':	6704 obs. of  6 variables:
 $ Age                : int  32 28 45 36 52 29 42 31 26 38 ...
 $ Gender             : chr  "Male" "Female" "Male" "Female" ...
 $ Education.Level    : chr  "Bachelor's" "Master's" "PhD" "Bachelor's" ...
 $ Job.Title          : chr  "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
 $ Years.of.Experience: num  5 3 15 7 20 2 12 4 1 10 ...
 $ Salary             : int  90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...

Age (나이): 정수형 변수
Gender (성별): 문자열 변수
Education.Level (학력 수준): 문자열 변수 -> "Bachelor's"는 학사 학위, "Master's"는 석사 학위, "PhD"는 박사 학위
Job.Title (직위): 문자열 변수, 개인의 직위를 나타낸다. 예를 들어 "Software Engineer"는 소프트웨어 엔지니어, "Data Analyst"는 데이터 분석가를 의미
Years.of.Experience (경력 연수): 숫자형 변수로, 개인의 경력 연수. 소수점을 포함할 수 있다
Salary (연봉): 정수형 변수로, 개인의 연봉

arrange()

- 데이터 프레임의 행을 정렬하는 데 사용

• 오름차순

df1 = df%>%
  arrange(Age)

df1

결과 : Age가 오름차순이것을 볼 수 잇음

  Age Gender   Education.Level                   Job.Title Years.of.Experience Salary
1    21 Female       High School Junior Sales Representative                 0.0  25000
2    21 Female       High School Junior Sales Representative                 0.0  25000
3    21 Female       High School Junior Sales Representative                 0.0  25000
4    21 Female       High School Junior Sales Representative                 0.0  25000
5    21 Female       High School Junior Sales Representative                 0.0  25000
6    21 Female       High School Junior Sales Representative                 0.0  25000
7    21 Female       High School Junior Sales Representative                 0.0  25000
8    21 Female       High School Junior Sales Representative                 0.0  25000
9    21 Female       High School Junior Sales Representative                 0.0  25000
10   21 Female       High School Junior Sales Representative                 0.0  25000
11   21 Female       High School Junior Sales Representative                 0.0  25000
12   21 Female       High School Junior Sales Representative                 0.0  25000
13   21 Female       High School Junior Sales Representative                 0.0  25000
14   21 Female       High School Junior Sales Representative                 0.0  25000
15   21 Female       High School Junior Sales Representative                 0.0  25000
16   21 Female       High School Junior Sales Representative                 0.0  25000
17   21 Female       High School Junior Sales Representative                 0.0  25000
18   21 Female       High School Junior Sales Representative                 0.0  25000
19   22   Male Bachelor's Degree         Front End Developer                 1.0  50000
20   22 Female       High School          Back end Developer                 0.0  51832
21   22 Female       High School          Back end Developer                 0.0  51832
22   22   Male Bachelor's Degree           Software Engineer                 1.0  50000

• 내림차순 : arrange( - 변수)

df2 = df%>%
  arrange(-Age)
df2

• gorup_by 후, arrange

df1 = df%>%
  group_by(Age,Gender)%>%
  summarise(count = n())%>%
  arrange(count)
df1

결과

# A tibble: 85 × 3
# Groups:   Age [42]
     Age Gender   count
   <int> <chr>    <int>
 1    58 "Female"     1
 2    60 "Male"       1
 3    23 "Other"      2
 4    25 "Other"      2
 5    31 "Other"      2
 6    37 "Other"      2
 7    53 "Other"      2
 8    56 "Female"     2
 9    61 "Male"       2
10    NA ""           2
# ℹ 75 more rows
# ℹ Use `print(n = ...)` to see more rows

top_n ()

- 데이터 프레임에서 특정 변수를 기준으로 상위 N개의 관측치를 선택하는 데 사용

함수로드방법

top_n(data, n, wt, ...)

data: 데이터 프레임
n: 선택할 상위 관측치의 개수
wt: 선택 기준이 될 변수
..:추가적인 선택적인 인수로, 필터링 조건 등을 포함

1. arrange 후 topn_n적용

df3 = df%>%
  arrange(-Salary)%>%
  top_n(n=10,wt=Salary)
df3

결과 : Salary(급여) 내림차순으로 정렬된 10개의 상위데이터추출

 Age Gender   Education.Level                Job.Title Years.of.Experience Salary
1   50   Male        Bachelor's                      CEO                  25 250000
2   52   Male               PhD Chief Technology Officer                  24 250000
3   45   Male Bachelor's Degree        Financial Manager                  21 250000
4   51   Male               PhD           Data Scientist                  24 240000
5   51   Male               PhD           Data Scientist                  24 240000
6   51   Male               PhD           Data Scientist                  24 240000
7   51   Male               PhD           Data Scientist                  24 240000
8   51   Male               PhD           Data Scientist                  24 240000
9   51   Male               PhD           Data Scientist                  24 240000
10  51   Male               PhD           Data Scientist                  24 240000
11  51   Male               PhD           Data Scientist                  24 240000

2. group_by 후 tonp_n적용

df5 = df%>%
  group_by(Gender)%>%
  top_n(n =5,wt=Age)
df5

결과 : group_by적용된 관측지 각각에 topn_이 적용됨

df5의 경우 Gender(other,male,female) 각각에 top_n(Age 나이에따라)가 적용됨을 알 수 있음

***n = 3으로 설정했을 때 동일한 조건을 가진 사람이 3명 이상인 경우에는 전체 관측치가 출력

A tibble: 16 × 6
# Groups:   Gender [3]
     Age Gender Education.Level Job.Title                 Years.of.Experience Salary
   <int> <chr>  <chr>           <chr>                                   <dbl>  <int>
 1    62 Male   PhD             Software Engineer Manager                  19 200000
 2    62 Male   PhD             Software Engineer Manager                  20 200000
 3    62 Male   PhD             Software Engineer Manager                  19 200000
 4    62 Male   PhD             Software Engineer Manager                  20 200000
 5    62 Male   PhD             Software Engineer Manager                  19 200000
 6    58 Female PhD             Software Engineer Manager                  17 195000
 7    53 Other  High School     Senior Project Engineer                    31 166109
 8    60 Female PhD             Software Engineer Manager                  33 179180
 9    60 Female PhD             Software Engineer Manager                  34 188651
10    53 Other  High School     Senior Project Engineer                    31 166109
11    60 Female PhD             Software Engineer Manager                  33 179180
12    60 Female PhD             Software Engineer Manager                  34 188651
13    54 Other  High School     Senior Software Engineer                   29 158254
14    54 Other  High School     Senior Software Engineer                   29 158966
15    54 Other  High School     Senior Software Engineer                   29 158966
16    54 Other  High School     Senior Software Engineer                   29 158966

-> group_by가 되어있기때문에 위 결과를 ungroup()으로 그룹화를 풀어준다

3.group_by 후 tonp_n적용 한 결과값 ubgroup

df5 = df%>%
  group_by(Gender)%>%
  top_n(n =5,wt=Age)%>%
  ungroup()

728x90