August 2023 – Overthinking it

There are a lot more hardcoded constants throughout my code, but seven parameters gave me enough trouble that I created named constants for them. ↩︎

Many years ago I created a R package to construct simple frequency tables. For all of R’s power, I’ve always found this most basic of summary functions to be lacking. So I created one that I liked for myself. I got a minimally viable package working and started using it for myself, but I never put in the effort to get it listed on CRAN.

Fast forward five years and I still find myself using my package all the time, but always with a need for caution as it’s not a package others on my team use or can “easily” install¹. I think it might be time to change that.

When starting virtually any analysis my first step is always to get the “lay of the land” and that almost always means examining the frequencies of my variables. Years ago I was mainly a SAS users—andway before that an SPSS user—and both offer simple, attractive, full-features functions to create frequency tables. Not so with R. The tools are there, of course, but it takes a fair bit of work. It’s kinda the difference between being given a house or being given a pile of wood, a hammer, and some nails.

And when I want a frequency table, I don’t want to do a lot of work to get it. I’m looking for something that’s easy, has sensible defaults—but the ability to customize when needed—and provides easy to read (and hopefully attractive) output both at the console and in knitted documents. I didn’t feel like any package really offered this previously. Hence why I made my package in the first place.

So maybe it’s time to get my package cleaned up and onto CRAN. But, before I go to the trouble, it seems prudent to see what the current state of frequency table functions is in the wider ecosystem. Am I trying to solve a problem that other packages have already solved (and perhaps better)? It feels like information I definitely should know before devoting a lot of time to this project.

So I undertook to identify packages with similar functions and to evaluate them on the features they have (and don’t have). What I found is that there are tons of ways to produce frequency tables. Here are eleven with my views on each.

base::table

The most obvious starting point for producing a frequency table is to simply use the base::table function. This is a minimal frills approach that, “out-of-the-box”, doesn’t offer a ton. You feed it a variable, it provides a simple (horizontal) table of frequencies.

That’s it.

No proportions, no cumulative results. By default, NAs are excluded. Of course, there is plenty you can do from there to get proportions, etc., if that’s what you want. But you’ll have to do some work to get this table to show you anything other than simple counts.

> base::table(iris$Species, useNA="always")
setosa versicolor  virginica       <NA> 
    47         49         50          4

Attribute	Rating
Input accepted	Vector
Number of Dependencies	0
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	No
Prints Metadata	No
Sorts Results by Frequency	No
Allows Optional Weighting	No
Produces Accompanying Graph	No
Number of Decimals Printed	N/A

dplyr::count

Likewise for the next obvious candidate, dplyr::count. I’m pretty fully immeshed in the ’Tidyverse’ and use its approach every time I touch R. So dplyr::count is a natural option for a lot of my work. But the reality is that, just like base::table, it takes a lot of work to get basic information I want from a frequency table out of dplyr:: count. By default you just get (are you ready for this): counts. If you need proportions are cumulative proportions, then you’ll be mutating the data yourself from there. Super powerful, super easy to program, super not what I’m looking for.

> iris %>%
+   count(Species)
     Species  n
1     setosa 47
2 versicolor 49
3  virginica 50
4       <NA>  4

Attribute	Rating
Input accepted	Tidy Var
Number of Dependencies	13
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	No
Prints Metadata	No
Sorts Results by Frequency	Yes
Allows Optional Weighting	Yes
Produces Accompanying Graph	No
Number of Decimals Printed	N/A

descr::freq

Next I turned my attention to a few packages that I’d used previously for frequency tables to remind myself what they had to offer. First up was descr::freq. To the best of my recollection, this may have been the package I used most often for frequency tables prior to producing my own attempt. There’s a lot here I like, and I’m sure I drew inspiration from it. It gives counts, but also percents. It, by default, shows missing values and gives valid percents as well. It also, by default, produces an accompanying bar graph of the counts. This is very much along the lines of what I’m looking

> descr::freq(iris$Species)
iris$Species 
           Frequency Percent Valid Percent
setosa            47  31.333         32.19
versicolor        49  32.667         33.56
virginica         50  33.333         34.25
NA's               4   2.667              
Total            150 100.000        100.00

Attribute	Rating
Input accepted	Vector
Number of Dependencies	5
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	Yes
Prints Metadata	No
Sorts Results by Frequency	No
Allows Optional Weighting	Yes
Produces Accompanying Graph	Yes
Number of Decimals Printed	3/2

janitor::tabyl

Another option I’ve used in the past comes from the janitor:: package: janitor::tabyl. It would be doing a disservice to this function to just label it a frequency table function. It’s far more than that. It’s meant as a replacement to base::table and comes with a myriad of ways to format the output to your liking. Unfortunately, simply out-of-the-box, the output isn’t what I’m looking for. You get frequencies, percents, and valid percents, but not formatted in a particularly appearing way. Not right for what I’m looking for.

> janitor::tabyl(iris, Species)
    Species  n    percent valid_percent
     setosa 47 0.31333333     0.3219178
 versicolor 49 0.32666667     0.3356164
  virginica 50 0.33333333     0.3424658
       <NA>  4 0.02666667            NA

Attribute	Rating
Input accepted	Data Frame or Vector
Number of Dependencies	12
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	No
Prints Metadata	No
Sorts Results by Frequency	No
Allows Optional Weighting	No
Produces Accompanying Graph	No
Number of Decimals Printed	7

questionr::freq

A third option I’ve used in the past is questionr::freq. While this package is designed to simplify survey analysis, it has a very good frequency function. By default it gives a lot of information in a fairly condensed format: n, percent, valid percent, percent cumulative, and valid percent cumulative. It’s perfectly serviceable, though, to my eyes at least, the output is so compressed that it’s actually a little hard to read.

> questionr::freq(iris$Species, cum=T)
            n    % val%  %cum val%cum
setosa     47 31.3 32.2  31.3    32.2
versicolor 49 32.7 33.6  64.0    65.8
virginica  50 33.3 34.2  97.3   100.0
NA          4  2.7   NA 100.0      NA

Attribute	Rating
Input accepted	Vector
Number of Dependencies	12
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	Yes
Prints Metadata	No
Sorts Results by Frequency	Yes
Allows Optional Weighting	No
Produces Accompanying Graph	No
Number of Decimals Printed	1

freqtables::freq_table

During my environment scan I came across two packages that certainly had promising names. The first was called freqtables:: and has a function called freqtables::freq_table. This function also gives a lot of good information by default, but unfortunately gives a lot of information that I’m not looking for as well (like a standard error for each category and upper and lower confidence intervals).

> freqtables::freq_table(iris, Species)
      var        cat  n n_total   percent       se   t_crit        lcl       ucl
1 Species     setosa 47     150 31.333333 3.799996 1.976013 24.3501516 39.279338
2 Species versicolor 49     150 32.666667 3.842150 1.976013 25.5694691 40.657721
3 Species  virginica 50     150 33.333333 3.861896 1.976013 26.1816359 41.344424
4 Species       <NA>  4     150  2.666667 1.319842 1.976013  0.9930968  6.962216

Attribute	Rating
Input accepted	Tidy Var
Number of Dependencies	3
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	No
Prints Metadata	No
Sorts Results by Frequency	No
Allows Optional Weighting	No
Produces Accompanying Graph	No
Number of Decimals Printed	6

frequency::freq

The second promisingly named function was frequency:: and its function frequency::freq. This package seeks to produce SPSS/SAS-like frequency tables. And it does. The output is attractive, and information-rich. But, unfortunately, by default it is directed to an html output. It is possible to get console output through setting some an option, but that’s, unfortunately, not what I’m looking for.

x <- frequency::freq(iris$Species)
Building tables
  |===================================================================================================| 100%
Temporary file saved to: /var/folders/yq/3n8lg2pn0lb81c__1dy3g52r0000gp/T//RtmpjnQ6Hr/file74f95a077397.html
To open by default use: options(frequency_open_output = TRUE)
> print(x)
$`x:`
                  x label Freq Percent Valid Percent Cumulative Percent
   Valid     setosa         47    31.3          32.2               32.2
         versicolor         49    32.7          33.6               65.8
          virginica         50    33.3          34.2              100.0
              Total        146    97.3         100.0                   
 Missing    <blank>          0     0.0                                 
               <NA>          4     2.7                                 
              Total        150   100.0

Attribute	Rating
Input accepted	Vector
Number of Dependencies	6
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	Yes
Prints Metadata	No
Sorts Results by Frequency	No
Allows Optional Weighting	Yes
Produces Accompanying Graph	Yes
Number of Decimals Printed	1

cleaner::freq

There were other options I came across during my environment scan that had less obvious package names but seemed worth examining. One of these was cleaner::freq. cleaner::freq produces some very attractive output with percents, cumulative counts and cumulative counts. It doesn’t show missing by default but it does offer this as an option, along with a lot of other options as well. One thing that I especially like about this option is that it formats the output differently in the console and in a rmarkdown/quarto document. By setting output=“axis” the code will render as a pretty nice table.

> cleaner::freq(iris$Species)
Frequency table 

Class:      factor (numeric)
Length:     150
Levels:     3: setosa, versicolor, virginica
Available:  146 (97.33%, NA: 4 = 2.67%)
Unique:     3

     Item            Count    Percent    Cum. Count    Cum. Percent
---  ------------  -------  ---------  ------------  --------------
1    virginica          50     34.25%            50          34.25%
2    versicolor         49     33.56%            99          67.81%
3    setosa             47     32.19%           146         100.00%

Attribute	Rating
Input accepted	Vector
Number of Dependencies	6
Pretty Console Output	Yes
Pretty Knitted Output	Yes
Prints Total row	No
Prints Metadata	Yes
Sorts Results by Frequency	Yes
Allows Optional Weighting	Yes
Produces Accompanying Graph	No
Number of Decimals Printed	2

summarytools::freq

summarytools::freq was another package I wasn’t familiar with but that had a promising function. Summary tools also produces some very attractive and informative output. It by default shows NAs and has lots of different options. There’s a Markdown option too, but, unlike with cleaner::freq you have to set it yourself.

> summarytools::freq(iris, Species)
Frequencies  
iris$Species  
Type: Factor  

                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
---------------- ------ --------- -------------- --------- --------------
          setosa     47     32.19          32.19     31.33          31.33
      versicolor     49     33.56          65.75     32.67          64.00
       virginica     50     34.25         100.00     33.33          97.33
            <NA>      4                               2.67         100.00
           Total    150    100.00         100.00    100.00         100.00

Attribute	Rating
Input accepted	Vector or Tidy Var
Number of Dependencies	17
Pretty Console Output	Yes
Pretty Knitted Output	Yes
Prints Total row	Yes
Prints Metadata	Yes
Sorts Results by Frequency	Yes
Allows Optional Weighting	Yes
Produces Accompanying Graph	No
Number of Decimals Printed	N/A

Epidisplay::tab1

I’m not sure how I came across epidisplay::tab1 but it produces a competent frequency table. By default it shows a bar chart. The console output is a bit minimal (showing only counts, percents, and valid percents) and there is no rmarkdown option. But, it has a lot of options for customization.

> epiDisplay::tab1(iris$Species, sort.group = T)
iris$Species : 
           Frequency   %(NA+)   %(NA-)
setosa            47     31.3     32.2
versicolor        49     32.7     33.6
virginica         50     33.3     34.2
NA's               4      2.7      0.0
  Total          150    100.0    100.0

Attribute	Rating
Input accepted	Vector
Number of Dependencies	4
Pretty Console Output	No
Pretty Knitted Output	No
Prints Total row	Yes
Prints Metadata	No
Sorts Results by Frequency	Yes
Allows Optional Weighting	No
Produces Accompanying Graph	Yes
Number of Decimals Printed	1

Datawizard::data_tabulate

The last package I came across was datawizard::data_tabulate. This function produces decent looking console output that renders as valid markdown as well. By defaults NAs are printed and it includes percents, valid percents, and cumulative percents. Options for customizing on this one are minimal though.

> datawizard::data_tabulate(iris$Species)
iris$Species <categorical>
# total N=150 valid N=146

Value      |  N | Raw % | Valid % | Cumulative %
-----------+----+-------+---------+-------------
setosa     | 47 | 31.33 |   32.19 |        32.19
versicolor | 49 | 32.67 |   33.56 |        65.75
virginica  | 50 | 33.33 |   34.25 |       100.00
<NA>       |  4 |  2.67 |    <NA> |         <NA>

Attribute	Rating
Input accepted	Vector
Number of Dependencies	3
Pretty Console Output	Yes
Pretty Knitted Output	Yes
Prints Total row	No
Prints Metadata	Yes
Sorts Results by Frequency	No
Allows Optional Weighting	No
Produces Accompanying Graph	No
Number of Decimals Printed	2

FreqR:freq

Finally, seems only fair that I put my (non-CRAN) package up to the same scrutiny that I put the others up to. I must say, I like my console output. I appreciate the seperation top and bottom dividers and the divider of the header from the table. FreqR:freq gives the counts, percents, cumulative frequency, and cumulative percent. By default I include NAs, but I don’t give valid percents, which may be a mistake. It also doesn’t render particularly well (or at all) in markdown.

> freqR::freq(iris, Species)

FREQUENCY: Species
============================================================
        Species     Freq        %     Cum. Freq     Cum. %
------------------------------------------------------------
      virginica       50     33.3            50       33.3
     versicolor       49     32.7            99       66.0
         setosa       47     31.3           146       97.3
             NA        4      2.7           150      100.0
============================================================

Attribute	Rating
Input accepted	Vector or Tidy Var
Number of Dependencies	3
Pretty Console Output	Yes
Pretty Knitted Output	No
Prints Total row	No
Prints Metadata	No
Sorts Results by Frequency	Yes
Allows Optional Weighting	No
Produces Accompanying Graph	Yes
Number of Decimals Printed	1

Summary

Here then is my summary. If I had to go with one of these today–other than my own–it’d probably be either cleaner::freq or summarytools::freq. Both produce attractive output both in markdown and on the console. And, through this exercise, it’s become clear to me that is my number one requirement. But neither fits the bill in other ways. Neither produces a graph by default, and I really appreciate that. cleaner:: doesn’t show missing by default, while summarytools:: is ‘heavy’ with 17 dependencies.

For me, this means I do, in fact, want to proceed with revising FreqR and submitting it to CRAN (almost certainly with a new name…I’m thinking “SimpleFreqs”). This exercise has definitely showed me some things I, personally, view as critical for a frequency table. I need the output to be pretty. I want it to produce an accompany graph. Missing should be shown by default, as should a totals row. Some of this my package currently does, some I’ll need to add. I’m excited to get started.

Actually, installing from GitHub is super easy, but it’s a tiny barrier and adds just enough extra friction that the added friction during coding review isn’t worth it for my little package. ↩︎

Month: August 2023

Reconstructing A Complex Graph Using GGPlot2

Experiments with ARtsy

Submitting Site to R-bloggers

Introducing simplefreqs

An environment scan of frequency table functions