dplyr – group last n row values

问题内容:

I have a dataframe as below

+--------+-----------+-----+
|  make  |   model   | cnt |
+--------+-----------+-----+
| toyota |  camry    |  10 |
| toyota |  corolla  |   4 |
| honda  |  city     |   8 |
| honda  |  accord   |  13 |
| jeep   |  compass  |   3 |
| jeep   |  wrangler |   5 |
| jeep   |  renegade |   1 |
| accura |  x1       |   2 |
| accura |  x3       |   1 |
+--------+-----------+-----+

I need to aggregate this dataframe by Make so as to get the total volume and share – I do this as follows.

df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
                 Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
                 Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1))
dfc <- df %>%
  group_by(Make) %>%
  summarise(volume = sum(Cnt)) %>%
  mutate(share=volume/sum(volume)*100.0) %>%
  arrange(desc(volume))

The above operation gives me the share and volume aggregated by Make as below.

+--------+--------+-----------+
| make   | volume | share     |
+--------+--------+-----------+
| honda  | 21     | 44.680851 |
| toyota | 14     | 29.787234 |
| jeep   | 9      | 19.148936 |
| accura | 3      | 6.382979  |
+--------+--------+-----------+

I need to group everything except the first two rows to a group others and also aggregate the volume and share such that the dataframe would look like below.

+--------+--------+-----------+
| make   | volume | share     |
+--------+--------+-----------+
| honda  | 21     | 44.680851 |
| toyota | 14     | 29.787234 |
| others | 12     | 25.53191  |
+--------+--------+-----------+

问题评论:

    
There are many ways to do that and really depends on the logic of your second grouping. Will you ALWAYS want to group everything apart from first two rows? Is there a name matching? A Volume threshold maybe?
2  
not always,in the sense that two is an example – but lets say group everything except the first n rows – the rows are sorted by ‘desc` volume.
– user3206440
3 hours ago
    
Got it. I was thinking whether a volume or a share threshold is a better method to group your data. In a way that you don’t have to manually specify it. Eg. Group together everything with a share less than 25%. In this way it will ALWAYS be a 25% threshold, but that might refer to 2,3,4,… rows.
    
If they’ll be in sort order when you do it, could you create a temp var to aggregate on? df$temp <- c(1, 2, ..., n, rep(n+1, nrow(df)-n) where you have n groups. Then just aggregate on that variable. It could also be recoded appropriately, I just did numbers for simplicity here.

答案:

答案1:

library(dplyr)

# example data
df <- data.frame(Make=c('toyota','toyota','honda','honda','jeep','jeep','jeep','accura','accura'),
                 Model=c('camry','corolla','city','accord','compass', 'wrangler','renegade','x1', 'x3'),
                 Cnt=c(10, 4, 8, 13, 3, 5, 1, 2, 1), stringsAsFactors = F)

# specify number of rows
row_threshold = 2

df %>%
  group_by(Make) %>%
  summarise(volume = sum(Cnt)) %>%
  mutate(share=volume/sum(volume)*100.0) %>%
  arrange(desc(volume)) %>%
  group_by(Make_upd = ifelse(row_number() > row_threshold, "others", Make)) %>%
  summarise(volume = sum(volume),
            share = sum(share))

# # A tibble: 3 x 3
#   Make_upd volume    share
#      <chr>  <dbl>    <dbl>
# 1    honda     21 44.68085
# 2   others     12 25.53191
# 3   toyota     14 29.78723

答案评论:

2  
Nice approach….
    
I see they order by volume descending, so it’s very likely they pick the number of rows to group based on a volume value. Still think it would be better to use a volume value threshold….

原文地址:

https://stackoverflow.com/questions/47753354/dplyr-group-last-n-row-values

Tags:,

添加评论

友情链接:蝴蝶教程