【R前処理講座22】{dplyr} arrange, distinct, accross：その他便利関数【tidyverse】

こんにちは，shun（@datasciencemore）です！！

今回は，今まで紹介したものよりは頻度は低いけど，知っていると便利な関数を紹介していきたいと思います．
これらを知っているとより前処理力が向上しますよ～

今回のデータは，みんな大好き，daimondsを使用していきます．

> diamonds
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

> diamonds

# A tibble: 53,940 x 10

carat cut color clarity depth table price x y z

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63

5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47

8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53

9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49

10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39

# … with 53,930 more rows

１．ソート　arrange
２．重複削除　distinct
３．列処理　across
まとめ

１．ソート　arrange

arrangeは指定した列をソートします．
デフォルトは昇順にソートしますが，descをつけることによって降順にソートすることもできます．

昇順，降順ってなんですか？

昇順：値を小さい順番に並べること

降順：値を大きい順番に並べること

だよ

具体例を見てみましょう．

１．１．昇順にソート

# depth列を昇順にソート
diamonds %>% 
  arrange(depth)

# depth列を昇順にソート

diamonds %>%

arrange(depth)

# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1    Fair  G     SI1      43      59  3634  6.32  6.27  3.97
 2  1.09 Ideal J     VS2      43      54  4778  6.53  6.55  4.12
 3  1    Fair  G     VS2      44      53  4032  6.31  6.24  4.12
 4  1.43 Fair  I     VS1      50.8    60  6727  7.73  7.25  3.93
 5  0.3  Fair  E     VVS2     51      67   945  4.67  4.62  2.37
 6  0.7  Fair  D     SI1      52.2    65  1895  6.04  5.99  3.14
 7  0.37 Fair  F     IF       52.3    61  1166  4.96  4.91  2.58
 8  0.56 Fair  H     VS2      52.7    70  1293  5.71  5.57  2.97
 9  1.02 Fair  I     SI1      53      63  2856  6.84  6.77  3.66
10  0.96 Fair  E     SI2      53.1    63  2815  6.73  6.65  3.55
# … with 53,930 more rows

# A tibble: 53,940 x 10

carat cut color clarity depth table price x y z

1 1 Fair G SI1 43 59 3634 6.32 6.27 3.97

2 1.09 Ideal J VS2 43 54 4778 6.53 6.55 4.12

3 1 Fair G VS2 44 53 4032 6.31 6.24 4.12

4 1.43 Fair I VS1 50.8 60 6727 7.73 7.25 3.93

5 0.3 Fair E VVS2 51 67 945 4.67 4.62 2.37

6 0.7 Fair D SI1 52.2 65 1895 6.04 5.99 3.14

7 0.37 Fair F IF 52.3 61 1166 4.96 4.91 2.58

8 0.56 Fair H VS2 52.7 70 1293 5.71 5.57 2.97

9 1.02 Fair I SI1 53 63 2856 6.84 6.77 3.66

10 0.96 Fair E SI2 53.1 63 2815 6.73 6.65 3.55

# … with 53,930 more rows

depth列が上から昇順にソートされていますね．

１．２．降順にソート

# depth列を降順にソート
diamonds %>% 
  arrange(desc(depth))

# depth列を降順にソート

diamonds %>%

arrange(desc(depth))

# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.5  Fair  E     VS2      79      73  2579  5.21  5.18  4.09
 2  0.5  Fair  E     VS2      79      73  2579  5.21  5.18  4.09
 3  1.03 Fair  E     I1       78.2    54  1262  5.72  5.59  4.42
 4  0.99 Fair  J     I1       73.6    60  1789  6.01  5.8   4.35
 5  0.9  Fair  G     SI1      72.9    54  2691  5.74  5.67  4.16
 6  0.96 Fair  G     SI2      72.2    56  2438  6.01  5.81  4.28
 7  1.02 Fair  H     VS1      71.8    56  4455  6.04  5.97  4.31
 8  0.99 Fair  H     VS2      71.6    57  3593  5.94  5.8   4.2 
 9  0.7  Fair  D     SI2      71.6    55  1696  5.47  5.28  3.85
10  1.5  Fair  I     I1       71.3    58  4368  6.85  6.81  4.87
# … with 53,930 more rows

# A tibble: 53,940 x 10

carat cut color clarity depth table price x y z

1 0.5 Fair E VS2 79 73 2579 5.21 5.18 4.09

2 0.5 Fair E VS2 79 73 2579 5.21 5.18 4.09

3 1.03 Fair E I1 78.2 54 1262 5.72 5.59 4.42

4 0.99 Fair J I1 73.6 60 1789 6.01 5.8 4.35

5 0.9 Fair G SI1 72.9 54 2691 5.74 5.67 4.16

6 0.96 Fair G SI2 72.2 56 2438 6.01 5.81 4.28

7 1.02 Fair H VS1 71.8 56 4455 6.04 5.97 4.31

8 0.99 Fair H VS2 71.6 57 3593 5.94 5.8 4.2

9 0.7 Fair D SI2 71.6 55 1696 5.47 5.28 3.85

10 1.5 Fair I I1 71.3 58 4368 6.85 6.81 4.87

# … with 53,930 more rows

descを列名にくっつけてあげると，降順にソートしてくれます．
depth列を見ると，確かに降順になっていますね！

１．３．複数列でソート

複数列をソートすることもできます．便利！！

# 複数列(depth列, table列)でソート
diamonds %>% 
  arrange(depth, table)

# 複数列(depth列, table列)でソート

diamonds %>%

arrange(depth, table)

# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.09 Ideal J     VS2      43      54  4778  6.53  6.55  4.12
 2  1    Fair  G     SI1      43      59  3634  6.32  6.27  3.97
 3  1    Fair  G     VS2      44      53  4032  6.31  6.24  4.12
 4  1.43 Fair  I     VS1      50.8    60  6727  7.73  7.25  3.93
 5  0.3  Fair  E     VVS2     51      67   945  4.67  4.62  2.37
 6  0.7  Fair  D     SI1      52.2    65  1895  6.04  5.99  3.14
 7  0.37 Fair  F     IF       52.3    61  1166  4.96  4.91  2.58
 8  0.56 Fair  H     VS2      52.7    70  1293  5.71  5.57  2.97
 9  1.02 Fair  I     SI1      53      63  2856  6.84  6.77  3.66
10  0.96 Fair  E     SI2      53.1    63  2815  6.73  6.65  3.55
# … with 53,930 more rows

# A tibble: 53,940 x 10

carat cut color clarity depth table price x y z

1 1.09 Ideal J VS2 43 54 4778 6.53 6.55 4.12

2 1 Fair G SI1 43 59 3634 6.32 6.27 3.97

3 1 Fair G VS2 44 53 4032 6.31 6.24 4.12

4 1.43 Fair I VS1 50.8 60 6727 7.73 7.25 3.93

5 0.3 Fair E VVS2 51 67 945 4.67 4.62 2.37

6 0.7 Fair D SI1 52.2 65 1895 6.04 5.99 3.14

7 0.37 Fair F IF 52.3 61 1166 4.96 4.91 2.58

8 0.56 Fair H VS2 52.7 70 1293 5.71 5.57 2.97

9 1.02 Fair I SI1 53 63 2856 6.84 6.77 3.66

10 0.96 Fair E SI2 53.1 63 2815 6.73 6.65 3.55

# … with 53,930 more rows

２．重複削除　distinct

distinctは，指定した列の重複を削除してくれます．

２．１．単数列の重複削除

# cut列の重複削除
diamonds %>%
  distinct(cut)

# cut列の重複削除

diamonds %>%

distinct(cut)

# A tibble: 5 x 1
  cut      
  <ord>    
1 Ideal    
2 Premium  
3 Good     
4 Very Good
5 Fair

# A tibble: 5 x 1

cut

<ord>

1 Ideal

2 Premium

3 Good

4 Very Good

5 Fair

２．２．複数列の重複削除

複数列を指定することも可能です．

# cut列, color列の重複削除
diamonds %>% 
  distinct(cut, color)

# cut列, color列の重複削除

diamonds %>%

distinct(cut, color)

# A tibble: 35 x 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Ideal     J    
# … with 25 more rows

# A tibble: 35 x 2

cut color

1 Ideal E

2 Premium E

3 Good E

4 Premium I

5 Good J

6 Very Good J

7 Very Good I

8 Very Good H

9 Fair E

10 Ideal J

# … with 25 more rows

２．３．重複削除　他の列も残す

引数.keep_allで指定列以外を残すかどうかを決定します．
TRUEにすると指定列以外も残します．
デフォルトはFALSEです．

# cut列の重複削除　他の列も残す
diamonds %>% 
  distinct(cut, .keep_all = TRUE)

# cut列の重複削除　他の列も残す

diamonds %>%

distinct(cut, .keep_all = TRUE)

# A tibble: 5 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
5  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

# A tibble: 5 x 10

carat cut color clarity depth table price x y z

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

4 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

5 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49

.keep_all = TRUEにした場合，指定列以外列の値は，指定列で最初に現れる行の値となります．

３．列処理　across

acrossは，列と処理内容を指定することで，列に処理内容を適用することができます．
rowwiseの列版というイメージです．

具体例を見ていきましょう．

# color毎のdepth, table, priceの平均
diamonds %>% 
  group_by(color) %>% 
  summarise(across(depth:price, mean))

# color毎のdepth, table, priceの平均

diamonds %>%

group_by(color) %>%

summarise(across(depth:price, mean))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 4
  color depth table price
  <ord> <dbl> <dbl> <dbl>
1 D      61.7  57.4 3170.
2 E      61.7  57.5 3077.
3 F      61.7  57.4 3725.
4 G      61.8  57.3 3999.
5 H      61.8  57.5 4487.
6 I      61.8  57.6 5092.
7 J      61.9  57.8 5324.

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 7 x 4

color depth table price

1 D 61.7 57.4 3170.

2 E 61.7 57.5 3077.

3 F 61.7 57.4 3725.

4 G 61.8 57.3 3999.

5 H 61.8 57.5 4487.

6 I 61.8 57.6 5092.

7 J 61.9 57.8 5324.

# 列：numeric型
# 処理内容：numeric型の平均を標準偏差で割る
# ~の後に処理内容を記載。.xは、指定した列を表している．
diamonds %>% 
  group_by(color) %>% 
  summarise(across(where(is.numeric), 
                   ~ mean(.x) / sd(.x) )
  )

# 列：numeric型

# 処理内容：numeric型の平均を標準偏差で割る

# ~の後に処理内容を記載。.xは、指定した列を表している．

diamonds %>%

group_by(color) %>%

summarise(across(where(is.numeric),

~ mean(.x) / sd(.x) )

)

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 8
  color carat depth table price     x     y     z
  <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D      1.83  43.7  26.0 0.944  5.77  5.79  5.78
2 E      1.78  42.7  25.6 0.920  5.63  5.46  5.07
3 F      1.85  42.9  25.4 0.984  5.57  5.62  5.54
4 G      1.75  45.1  26.7 0.987  5.24  5.28  5.20
5 H      1.75  42.8  25.7 1.06   5.00  4.54  4.97
6 I      1.77  42.4  25.0 1.08   4.96  5.00  4.99
7 J      1.95  39.9  25.0 1.20   5.42  5.45  5.44

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 7 x 8

color carat depth table price x y z

1 D 1.83 43.7 26.0 0.944 5.77 5.79 5.78

2 E 1.78 42.7 25.6 0.920 5.63 5.46 5.07

3 F 1.85 42.9 25.4 0.984 5.57 5.62 5.54

4 G 1.75 45.1 26.7 0.987 5.24 5.28 5.20

5 H 1.75 42.8 25.7 1.06 5.00 4.54 4.97

6 I 1.77 42.4 25.0 1.08 4.96 5.00 4.99

7 J 1.95 39.9 25.0 1.20 5.42 5.45 5.44

# 列：numeric型
# 処理内容：(numeric型 - 平均)を標準偏差で割る．
diamonds %>% 
  mutate(
    across(where(is.numeric), 
           ~ (.x - mean(.x)) / sd(.x)
    )
  )

# 列：numeric型

# 処理内容：(numeric型 - 平均)を標準偏差で割る．

diamonds %>%

mutate(

across(where(is.numeric),

~ (.x - mean(.x)) / sd(.x)

)

# A tibble: 53,940 x 10
   carat cut       color clarity  depth  table  price     x     y     z
   <dbl> <ord>     <ord> <ord>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
 1 -1.20 Ideal     E     SI2     -0.174 -1.10  -0.904 -1.59 -1.54 -1.57
 2 -1.24 Premium   E     SI1     -1.36   1.59  -0.904 -1.64 -1.66 -1.74
 3 -1.20 Good      E     VS1     -3.38   3.38  -0.904 -1.50 -1.46 -1.74
 4 -1.07 Premium   I     VS2      0.454  0.243 -0.902 -1.36 -1.32 -1.29
 5 -1.03 Good      J     SI2      1.08   0.243 -0.902 -1.24 -1.21 -1.12
 6 -1.18 Very Good J     VVS2     0.733 -0.205 -0.902 -1.60 -1.55 -1.50
 7 -1.18 Very Good I     VVS1     0.384 -0.205 -0.902 -1.59 -1.54 -1.51
 8 -1.13 Very Good H     SI1      0.105 -1.10  -0.901 -1.48 -1.42 -1.43
 9 -1.22 Fair      E     VS2      2.34   1.59  -0.901 -1.66 -1.71 -1.49
10 -1.20 Very Good H     VS1     -1.64   1.59  -0.901 -1.54 -1.47 -1.63
# … with 53,930 more rows

# A tibble: 53,940 x 10

carat cut color clarity depth table price x y z

1 -1.20 Ideal E SI2 -0.174 -1.10 -0.904 -1.59 -1.54 -1.57

2 -1.24 Premium E SI1 -1.36 1.59 -0.904 -1.64 -1.66 -1.74

3 -1.20 Good E VS1 -3.38 3.38 -0.904 -1.50 -1.46 -1.74

4 -1.07 Premium I VS2 0.454 0.243 -0.902 -1.36 -1.32 -1.29

5 -1.03 Good J SI2 1.08 0.243 -0.902 -1.24 -1.21 -1.12

6 -1.18 Very Good J VVS2 0.733 -0.205 -0.902 -1.60 -1.55 -1.50

7 -1.18 Very Good I VVS1 0.384 -0.205 -0.902 -1.59 -1.54 -1.51

8 -1.13 Very Good H SI1 0.105 -1.10 -0.901 -1.48 -1.42 -1.43

9 -1.22 Fair E VS2 2.34 1.59 -0.901 -1.66 -1.71 -1.49

10 -1.20 Very Good H VS1 -1.64 1.59 -0.901 -1.54 -1.47 -1.63

# … with 53,930 more rows

ちょっと難しいけど，慣れると便利ですね．

acrossは，最近できたばかりで歴史が浅いので，今後文法が変わってくる可能性があります．
ただ，上述のイメージ図の列を指定して，その列ごとに処理するという根本は覆らないため，このイメージは覚えておいてください．

まとめ

今回は，便利関数である3つについてやりました．

ソート　arrange
重複削除　distinct
列処理　across

これらを使いこなせるとより前処理が楽になりますよ～

それじゃ，お疲れ様でした！！

【R前処理講座22】{dplyr} arrange, distinct, accross：その他便利関数 【tidyverse】

１．ソート arrange

１．１．昇順にソート

１．２．降順にソート

１．３．複数列でソート

２．重複削除 distinct

２．１．単数列の重複削除

２．２．複数列の重複削除

２．３．重複削除 他の列も残す

３．列処理 across

まとめ

【R前処理講座22】{dplyr} arrange, distinct, accross：その他便利関数【tidyverse】

１．ソート　arrange

２．重複削除　distinct

２．３．重複削除　他の列も残す

３．列処理　across