Overthinking it

One dumb thought at a time

Page 2 of 5

An environment scan of frequency table functions

Many years ago I created a R package to construct simple frequency tables.  For all of R’s power, I’ve always found this most basic of summary functions to be lacking.  So I created one that I liked for myself. I got a minimally viable package working and started using it for myself, but I never put in the effort to get it listed on CRAN. 

Fast forward five years and I still find myself using my package all the time, but always with a need for caution as it’s not a package others on my team use or can “easily” install1. I think it might be time to change that.

When starting virtually any analysis my first step is always to get the “lay of the land” and that almost always means examining the frequencies of my variables.  Years ago I was mainly a SAS users—andway before that an SPSS user—and both offer simple, attractive, full-features functions to create frequency tables. Not so with R.  The tools are there, of course, but it takes a fair bit of work. It’s kinda the difference between being given a house or being given a pile of wood, a hammer, and some nails.

And when I want a frequency table, I don’t want to do a lot of work to get it. I’m looking for something that’s easy, has sensible defaults—but the ability to customize when needed—and provides easy to read (and hopefully attractive) output both at the console and in knitted documents.  I didn’t feel like any package really offered this previously.  Hence why I made my package in the first place. 

So maybe it’s time to get my package cleaned up and onto CRAN. But, before I go to the trouble, it seems prudent to see what the current state of frequency table functions is in the wider ecosystem.  Am I trying to solve a problem that other packages have already solved (and perhaps better)?  It feels like information I definitely should know before devoting a lot of time to this project.

So I undertook to identify packages with similar functions and to evaluate them on the features they have (and don’t have).  What I found is that there are tons of ways to produce frequency tables.  Here are eleven with my views on each.

base::table

The most obvious starting point for producing a frequency table is to simply use the base::table function.  This is a minimal frills approach that, “out-of-the-box”, doesn’t offer a ton.  You feed it a variable, it provides a simple (horizontal) table of frequencies. 

That’s it.

No proportions, no cumulative results.  By default, NAs are excluded.  Of course, there is plenty you can do from there to get proportions, etc., if that’s what you want.  But you’ll have to do some work to get this table to show you anything other than simple counts.

> base::table(iris$Species, useNA="always")
setosa versicolor  virginica       <NA> 
    47         49         50          4 
AttributeRating
Input acceptedVector
Number of Dependencies0
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowNo
Prints MetadataNo
Sorts Results by FrequencyNo
Allows Optional WeightingNo
Produces Accompanying GraphNo
Number of Decimals PrintedN/A

dplyr::count

Likewise for the next obvious candidate, dplyr::count.  I’m pretty fully immeshed in the ’Tidyverse’ and use its approach every time I touch R.  So dplyr::count is a natural option for a lot of my work.  But the reality is that, just like base::table, it takes a lot of work to get basic  information I want from a frequency table out of dplyr:: count.  By default you just get (are you ready for this): counts. If you need proportions are cumulative proportions, then you’ll be mutating the data yourself from there.  Super powerful, super easy to program, super not what I’m looking for.

> iris %>%
+   count(Species)
     Species  n
1     setosa 47
2 versicolor 49
3  virginica 50
4       <NA>  4
AttributeRating
Input acceptedTidy Var
Number of Dependencies13
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowNo
Prints MetadataNo
Sorts Results by FrequencyYes
Allows Optional WeightingYes
Produces Accompanying GraphNo
Number of Decimals PrintedN/A

descr::freq

Next I turned my attention to a few packages that I’d used previously for frequency tables to remind myself what they had to offer.  First up was descr::freq.  To the best of my recollection, this may have been the package I used most often for frequency tables prior to producing my own attempt.  There’s a lot here I like, and I’m sure I drew inspiration from it.  It gives counts, but also percents.  It, by default, shows missing values and gives valid percents as well.  It also, by default, produces an accompanying bar graph of the counts.  This is very much along the lines of what I’m looking

> descr::freq(iris$Species)
iris$Species 
           Frequency Percent Valid Percent
setosa            47  31.333         32.19
versicolor        49  32.667         33.56
virginica         50  33.333         34.25
NA's               4   2.667              
Total            150 100.000        100.00
AttributeRating
Input acceptedVector
Number of Dependencies5
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowYes
Prints MetadataNo
Sorts Results by FrequencyNo
Allows Optional WeightingYes
Produces Accompanying GraphYes
Number of Decimals Printed3/2

janitor::tabyl

Another option I’ve used in the past comes from the janitor:: package: janitor::tabyl.  It would be doing a disservice to this function to just label it a frequency table function.  It’s far more than that. It’s meant as a replacement to base::table and comes with a myriad of ways to format the output to your liking.  Unfortunately, simply out-of-the-box, the output isn’t what I’m looking for.  You get frequencies, percents, and valid percents, but not formatted in a particularly appearing way.  Not right for what I’m looking for.

> janitor::tabyl(iris, Species)
    Species  n    percent valid_percent
     setosa 47 0.31333333     0.3219178
 versicolor 49 0.32666667     0.3356164
  virginica 50 0.33333333     0.3424658
       <NA>  4 0.02666667            NA
AttributeRating
Input acceptedData Frame or Vector
Number of Dependencies12
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowNo
Prints MetadataNo
Sorts Results by FrequencyNo
Allows Optional WeightingNo
Produces Accompanying GraphNo
Number of Decimals Printed7

questionr::freq

A third option I’ve used in the past is questionr::freq.  While this package is designed to simplify survey analysis, it has a very good frequency function.  By default it gives a lot of information in a fairly condensed format: n, percent, valid percent, percent cumulative, and valid percent cumulative.  It’s perfectly serviceable, though, to my eyes at least, the output is so compressed that it’s actually a little hard to read.

> questionr::freq(iris$Species, cum=T)
            n    % val%  %cum val%cum
setosa     47 31.3 32.2  31.3    32.2
versicolor 49 32.7 33.6  64.0    65.8
virginica  50 33.3 34.2  97.3   100.0
NA          4  2.7   NA 100.0      NA
AttributeRating
Input acceptedVector
Number of Dependencies12
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowYes
Prints MetadataNo
Sorts Results by FrequencyYes
Allows Optional WeightingNo
Produces Accompanying GraphNo
Number of Decimals Printed1

freqtables::freq_table

During my environment scan I came across two packages that certainly had promising names.  The first was called freqtables:: and has a function called freqtables::freq_table.  This function also gives a lot of good information by default, but unfortunately gives a lot of information that I’m not looking for as well (like a standard error for each category and upper and lower confidence intervals).  

> freqtables::freq_table(iris, Species)
      var        cat  n n_total   percent       se   t_crit        lcl       ucl
1 Species     setosa 47     150 31.333333 3.799996 1.976013 24.3501516 39.279338
2 Species versicolor 49     150 32.666667 3.842150 1.976013 25.5694691 40.657721
3 Species  virginica 50     150 33.333333 3.861896 1.976013 26.1816359 41.344424
4 Species       <NA>  4     150  2.666667 1.319842 1.976013  0.9930968  6.962216
AttributeRating
Input acceptedTidy Var
Number of Dependencies3
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowNo
Prints MetadataNo
Sorts Results by FrequencyNo
Allows Optional WeightingNo
Produces Accompanying GraphNo
Number of Decimals Printed6

frequency::freq

The second promisingly named function was frequency:: and its function frequency::freq.  This package seeks to produce SPSS/SAS-like frequency tables. And it does.  The output is attractive, and information-rich.  But, unfortunately, by default it is directed to an html output.  It is possible to get console output through setting some an option, but that’s, unfortunately, not what I’m looking for.

x <- frequency::freq(iris$Species)
Building tables
  |===================================================================================================| 100%
Temporary file saved to: /var/folders/yq/3n8lg2pn0lb81c__1dy3g52r0000gp/T//RtmpjnQ6Hr/file74f95a077397.html
To open by default use: options(frequency_open_output = TRUE)
> print(x)
$`x:`
                  x label Freq Percent Valid Percent Cumulative Percent
   Valid     setosa         47    31.3          32.2               32.2
         versicolor         49    32.7          33.6               65.8
          virginica         50    33.3          34.2              100.0
              Total        146    97.3         100.0                   
 Missing    <blank>          0     0.0                                 
               <NA>          4     2.7                                 
              Total        150   100.0  
AttributeRating
Input acceptedVector
Number of Dependencies6
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowYes
Prints MetadataNo
Sorts Results by FrequencyNo
Allows Optional WeightingYes
Produces Accompanying GraphYes
Number of Decimals Printed1

cleaner::freq

There were other options I came across during my environment scan that had less obvious package names but seemed worth examining.  One of these was cleaner::freqcleaner::freq produces some very attractive output with percents, cumulative counts and cumulative counts.  It doesn’t show missing by default but it does offer this as an option, along with a lot of other options as well. One thing that I especially like about this option is that it formats the output differently in the console and in a rmarkdown/quarto document.  By setting output=“axis” the code will render as a pretty nice table.

> cleaner::freq(iris$Species)
Frequency table 

Class:      factor (numeric)
Length:     150
Levels:     3: setosa, versicolor, virginica
Available:  146 (97.33%, NA: 4 = 2.67%)
Unique:     3

     Item            Count    Percent    Cum. Count    Cum. Percent
---  ------------  -------  ---------  ------------  --------------
1    virginica          50     34.25%            50          34.25%
2    versicolor         49     33.56%            99          67.81%
3    setosa             47     32.19%           146         100.00%
AttributeRating
Input acceptedVector
Number of Dependencies6
Pretty Console OutputYes
Pretty Knitted OutputYes
Prints Total rowNo
Prints MetadataYes
Sorts Results by FrequencyYes
Allows Optional WeightingYes
Produces Accompanying GraphNo
Number of Decimals Printed2

summarytools::freq

summarytools::freq was another package I wasn’t familiar with but that had a promising function. Summary tools also produces some very attractive and informative output.  It by default shows NAs and has lots of different options.  There’s a Markdown option too, but, unlike with cleaner::freq you have to set it yourself.

> summarytools::freq(iris, Species)
Frequencies  
iris$Species  
Type: Factor  

                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
---------------- ------ --------- -------------- --------- --------------
          setosa     47     32.19          32.19     31.33          31.33
      versicolor     49     33.56          65.75     32.67          64.00
       virginica     50     34.25         100.00     33.33          97.33
            <NA>      4                               2.67         100.00
           Total    150    100.00         100.00    100.00         100.00
AttributeRating
Input acceptedVector or Tidy Var
Number of Dependencies17
Pretty Console OutputYes
Pretty Knitted OutputYes
Prints Total rowYes
Prints MetadataYes
Sorts Results by FrequencyYes
Allows Optional WeightingYes
Produces Accompanying GraphNo
Number of Decimals PrintedN/A

Epidisplay::tab1 

I’m not sure how I came across epidisplay::tab1 but it produces a competent frequency table.  By default it shows a bar chart.  The console output is a bit minimal (showing only counts, percents, and valid percents) and there is no rmarkdown option.  But, it has a lot of options for customization.

> epiDisplay::tab1(iris$Species, sort.group = T)
iris$Species : 
           Frequency   %(NA+)   %(NA-)
setosa            47     31.3     32.2
versicolor        49     32.7     33.6
virginica         50     33.3     34.2
NA's               4      2.7      0.0
  Total          150    100.0    100.0
AttributeRating
Input acceptedVector
Number of Dependencies4
Pretty Console OutputNo
Pretty Knitted OutputNo
Prints Total rowYes
Prints MetadataNo
Sorts Results by FrequencyYes
Allows Optional WeightingNo
Produces Accompanying GraphYes
Number of Decimals Printed1

Datawizard::data_tabulate 

The last package I came across was datawizard::data_tabulate.  This function produces decent looking console output that renders as valid markdown as well.  By defaults NAs are printed and it includes percents, valid percents, and cumulative percents.  Options for customizing on this one are minimal though.

> datawizard::data_tabulate(iris$Species)
iris$Species <categorical>
# total N=150 valid N=146

Value      |  N | Raw % | Valid % | Cumulative %
-----------+----+-------+---------+-------------
setosa     | 47 | 31.33 |   32.19 |        32.19
versicolor | 49 | 32.67 |   33.56 |        65.75
virginica  | 50 | 33.33 |   34.25 |       100.00
<NA>       |  4 |  2.67 |    <NA> |         <NA>
AttributeRating
Input acceptedVector
Number of Dependencies3
Pretty Console OutputYes
Pretty Knitted OutputYes
Prints Total rowNo
Prints MetadataYes
Sorts Results by FrequencyNo
Allows Optional WeightingNo
Produces Accompanying GraphNo
Number of Decimals Printed2

FreqR:freq

Finally, seems only fair that I put my (non-CRAN) package up to the same scrutiny that I put the others up to.  I must say, I like my console output.  I appreciate the seperation top and bottom dividers and the divider of the header from the table.  FreqR:freq gives the counts, percents, cumulative frequency, and cumulative percent.  By default I include NAs, but I don’t give valid percents, which may be a mistake.  It also doesn’t render particularly well (or at all) in markdown.

> freqR::freq(iris, Species)

FREQUENCY: Species
============================================================
        Species     Freq        %     Cum. Freq     Cum. %
------------------------------------------------------------
      virginica       50     33.3            50       33.3
     versicolor       49     32.7            99       66.0
         setosa       47     31.3           146       97.3
             NA        4      2.7           150      100.0
============================================================
AttributeRating
Input acceptedVector or Tidy Var
Number of Dependencies3
Pretty Console OutputYes
Pretty Knitted OutputNo
Prints Total rowNo
Prints MetadataNo
Sorts Results by FrequencyYes
Allows Optional WeightingNo
Produces Accompanying GraphYes
Number of Decimals Printed1

Summary

Here then is my summary. If I had to go with one of these today–other than my own–it’d probably be either cleaner::freq or summarytools::freq.  Both produce attractive output both in markdown and on the console.  And, through this exercise, it’s become clear to me that is my number one requirement.  But neither fits the bill in other ways. Neither produces a graph by default, and I really appreciate that.  cleaner:: doesn’t show missing by default, while summarytools:: is ‘heavy’ with 17 dependencies.

For me, this means I do, in fact, want to proceed with revising FreqR and submitting it to CRAN (almost certainly with a new name…I’m thinking “SimpleFreqs”).  This exercise has definitely showed me some things I, personally, view as critical for a frequency table. I need the output to be pretty. I want it to produce an accompany graph. Missing should be shown by default, as should a totals row. Some of this my package currently does, some I’ll need to add. I’m excited to get started.

  1. Actually, installing from GitHub is super easy, but it’s a tiny barrier and adds just enough extra friction that the added friction during coding review isn’t worth it for my little package. ↩︎

Music Videos

I’ve started a new waste-of-time project. I’m posting one music video to mastodon (@mbarron@mas.to) each day. The only rule about what’s included is that I like the song. Though, in typical fashion, that means the videos are heavily pulled from about, oh, six years immediately before, during, and after my college years. \_(ツ)_/¯

(As I’ve said to many people: I’ve got all the world’s music at my fingertips and I’m still listening to the same 500 songs I heard in college).

You can follow along on mastodon or find the whole playlist on YouTube.

The Rise and Fall of generations

My previous post showing the definitions of different generations was in service of creating the chart below. This chart illustrates the “rise and fall” of generations across their lifecycle. As a new generation is born, it’s share of the population increases. Once a generations births have ended, there is a very long tail as people in that generation slowly die and new generations are born.

The data from the chart below comes from the U.S. Census bureau. I was able to use yearly population estimates from 1980 onward, but prior to 1980 the data comes from decennial Census PUMs data (hence why, for example, Gen X looks it starts in 1970–instead of 1965–and why the baby boomers have such an odd slope between 1960 and 1970). The Census Bureau does not publish birth year, so I estimated birth year (and thus generation) from age and year of estimate. There will thus ‘slop’ in my estimates but they should be close for my purpose here.

Chart showing the generational distribution of the U.S. population from 1940 - 2021

Defining the generations

Generations is a very… imprecise… sociological concept. People just sorta look at a rough cohort of ages and say “yeah, you all are a generation.” Start and end points (as well as labels) just sorta coalesce out of the ether.

Nevertheless, they have their use and certainly are cemented in the popular imagination. I’ve been doing some reading on generations and really appreciated this diagram and article by Pew Research. But I wish it went a little earlier and a little later. So I made my owner version, *heavily* influenced by the Pew version.

A diagram showing generation birth year starting with the Lost Generation (born 1883 - 1900) and ending with Generation Alpha (Born 2013-2030).
« Older posts Newer posts »

© 2024 Overthinking it

Theme by Anders NorenUp ↑