Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

9 Answers 11

If x is a dataframe with your data, then the following will do what you want:

recast(x, Category ~ ., fun.aggregate=sum)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

Just to add a third option:

summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)

EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.

up vote 219 down vote accepted

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
1 upvote
this answer is the only one which doesn't make use of any external library; however, I prefer to use doBy at least, which allows to group by more than one function, and has a fancier syntax. – dalloliogm
12 upvote
And for more current searchers, the formula interface to aggregate is nice: aggregate(Frequency ~ Category,data=x,sum) – thelatemail
How is the formula interface intuitive? How is Frequency ~ Category a statistical formula? – Andrew McKinlay
2 upvote
@AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives. – r2evans

This is somewhat related to this question.

You can also just use the by() function:

x2 <- by(x$Frequency, x$Category, sum)

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
7 upvote
+1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31. – Matt Dowle
1 upvote
True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data. – asieira
I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it! – zazu
1 upvote
There is a even shorter way to write this data[, sum(Frequency), by = Category]. You could use .N which substitutes the sum() function. data[, .N, by = Category]. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/… – Stophface
Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here. – asieira

More recently, you can also use the dplyr package for that purpose:

x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 

Update for dplyr >= 0.5: summarise_each has been replaced by summarise_all, summarise_at and summarise_if family of functions in dplyr.

Or, if you have multiple columns to group by, you can specify all of them in the group_by separated with commas:

mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

For more information, including the %>% operator, see the introduction to dplyr.

How fast is it when compared to the data.table and aggregate alternatives presented in other answers? – asieira
2 upvote
@asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here – docendo discimus

Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 

Or if want a data.frame back

as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

While I have recently become a convert to dplyr for most of these types of operations, the sqldf package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 

          ,sum(Frequency) as Frequency 
       from x 
       group by 

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

Not the answer you're looking for? Browse other questions tagged or ask your own question.