A typical jstats workflow

The Quick Start raced to a first result, and jstats with your own data covered getting files in and out. This page puts those pieces together and shows what an ordinary working session actually looks like — start to finish. The point isn’t any one command; it’s the order: how a real analysis moves from opening the data to a saved result, and where the everyday “working comfortably” tools fall into place along the way.

It runs on the shipped practice datasets, so you can follow along with nothing of your own: community for the main thread, with a short detour to its messy companion clinic for the one step that needs a dirty column. Swap in your own data wherever you like — the moves are the same.

You don’t need to understand every line below to get the shape of a session. Read it once straight through for the story — load, look, tidy, analyze, save — and don’t worry if a particular function is new. Each one has its own help page (type ? and the name, e.g. ?jdesc), and the earlier Get Started pages introduce the basics. The goal here is the arc, not mastery of every step.

Open the data and get your bearings

Every session starts by getting the data in front of you and taking a first look. The community data ships with the package, so it’s there by name the moment you run library(jstats) — no importing needed. (Opening a file of your own is covered in jstats with your own data.) juse() sets it as the session default, so later commands don’t need to be told which data frame to use every time.

juse(community)
Default data frame set to: community

juse() plays the role Stata’s use does: it names the active dataset once, and the analysis commands that follow assume it. In SPSS terms, it’s a bit like having a single data file open in Data View that every menu command then acts on — except you can load several data frames at once and switch which is the default whenever you like.

With the data loaded, jscreen() is the orientation pass — one call that lists every variable, its type, its range, and anything that looks off (odd values, missing data). stats = TRUE adds a center-and-spread column for the numeric variables.

jscreen(stats = TRUE)
Data Screening
Using default data frame: community
  Cases: 100 
  Variables: 15 
  Cases with missing data: 30 
  Variables with outliers: 0 

Variable Types
Variable        jstats Class  Sub-class   Unique Values       Mean   Median
--------------  ------------  ----------  -------------  ---------  -------
RespondentID    Categorical   identifier            100                    
Income          Numeric                              52  47414.894  48000.0
Education       Categorical   5-category              5                    
Age             Numeric                              41     40.660     40.0
WellbeingScore  Numeric                              43     50.600     51.0
Volunteer       Categorical   dichotomy               2      0.420         
OwnsHome        Categorical   dichotomy*              2      1.540         
Smoker          Categorical   dichotomy               2      0.421         
CommuteTime     Numeric                              46     31.340     30.5
Region          Categorical   4-category              4                    
Environment1    Categorical   Likert                  5                    
Environment2    Categorical   Likert                  5                    
Environment3    Categorical   Likert                  5                    
Environment4    Categorical   Likert                  5                    
Environment5    Categorical   Likert                  5                    
* coded other than 0/1; mean is not a proportion

Missing Data & Outliers (outliers > 3 SD from mean)
Variable      Missing  % Missing
------------  -------  ---------
Income              6        6.0
Education           6        6.0
Smoker              5        5.0
Environment1       12       12.0
Environment3       12       12.0

A screen like this is where you catch problems early — a variable that should be a 0/1 dichotomy showing values of 1 and 2, say, or a “score” whose minimum is an impossible negative number. That second one is exactly what we tackle next.

A column that arrived messy

community ships clean, which is convenient but not realistic — real data usually need some tidying before they can be trusted. So for this one step we detour to clinic, jstats’s deliberately messy companion dataset (also shipped, so it’s available by name too), clean a column there, and then return to community for the rest of the session.

juse(clinic)
Default data frame set to: clinic

The column MoodRating is meant to run from 1 to 10, but it arrived with two stand-in codes buried in it as ordinary numbers: -99 for “Refused” and -98 for “Don’t know” — the usual state of affairs after a plain CSV or Excel import, where the file carries no record of what those numbers mean. Until we tell jstats they stand for missing, they are treated as real values and quietly poison every statistic:

jdesc(MoodRating)
Descriptive Statistics
Using default data frame: clinic

Variable    Total  Non_missing  Min  Max    Mean      SD
----------  -----  -----------  ---  ---  ------  ------
MoodRating     70           70  -99    9  -4.943  31.477

The mean is dragged far below the 1–10 range, because those large negative codes are being averaged in as if they were genuine ratings. jdeclare_udm() fixes this the careful way — it declares the two codes as missing, with labels, without overwriting or deleting anything. The underlying data are untouched; we’ve only told jstats how to read them.

clinic <- jdeclare_udm(clinic, MoodRating,
                       codes = c("Refused" = -99, "Don't know" = -98))
Declared SPSS-style missing values in:
  clinic$MoodRating
  -99 ["Refused"]
  -98 ["Don't know"]

Assign the result to keep the declaration:
  clinic <- jdeclare_udm(clinic, MoodRating, ...)

To keep it across sessions, save the data frame:
  jsave(clinic, "clinic.rds")
jdesc(MoodRating)
Descriptive Statistics
Using default data frame: clinic

Case Processing  Excluded  Remaining
    Original            —         70
    Remaining N         —         70

────────────────────────────────────


Variable    Total  Non_missing  Min  Max  Mean     SD
----------  -----  -----------  ---  ---  ----  -----
MoodRating     70           63    1    9  5.46  1.702

Now the codes are set aside as missing and the mean sits sensibly inside the 1–10 range.

This is the equivalent of setting a variable’s missing values in SPSS’s Variable View (the MISSING VALUES command in syntax), or Stata’s extended missing values. The important parallel is that it’s non-destructive: like declaring missing values rather than recoding them to system-missing, jdeclare_udm() leaves the original codes in place and simply flags them, so nothing is lost and the declaration can be changed or undone later.

Declaring missing values is a small move with a lot behind it — different programs (SPSS, Stata, SAS) record missing data in different ways, and jstats keeps those distinctions straight when you read and write files. jstats with your own data covers the import/export side, and Book 2 covers the cross-program fidelity in more detail, with worked examples. The help page ?jdeclare_udm documents the options.

juse(community)
Default data frame set to: community

With the detour done, community is the session default again.

Build a scale

A common early task is combining several survey items into one scale score — but only after checking the items hang together. community has a five-item Environment battery (Environment1 through Environment5).

By default jstats prints variable names, which keeps output uncluttered. But a variable’s label, when it has one, usually carries the actual question wording — and that can be a useful clue when something looks off. So before we start, let’s switch the display to bring the labels along. joutput() sets output preferences for the rest of the session; variable.id = "legend" keeps the short names in the tables and adds a label legend beneath them:

joutput(variable.id = "legend")
Output Settings
Level: standard
  effect.size: ON
  regression.ci: OFF
  means.ci: ON
  levene: OFF
  posthoc: OFF
  diagnostics: OFF
  case.processing: AUTO
  case.processing.detail: TOTALS
  variable.id: LEGEND (override)
  value.id: BOTH
  ref.categories: ON
  udm.notice: AUTO
  digits: 3

Running joutput() prints its full settings panel — every output setting, not only the one you changed — so the screenful you see here is expected, not a sign anything went wrong. The setting you changed, variable.id, is now marked (override), flagging that it differs from the default; the others stay at their usual values.

You can also set this for a single call, by passing variable.id = "legend" to that function directly instead of session-wide.

variable.id is just one of several display settings joutput() controls. ?joutput lists them all, and the Reference page links every function to its own help. We won’t work through every setting here — the guides are a starting point, not a complete manual — so for the output options covered in more detail, with worked examples, see Book 2.

Now jalpha() reports Cronbach’s alpha and the per-item diagnostics, with each item’s label shown in the legend below:

jalpha(Environment1, Environment2, Environment3, Environment4, Environment5)
Reliability Analysis
Using default data frame: community

Case Processing    Excluded  Remaining
    Original              —        100
    Auto-listwise        18         82
    Analysis N            —         82

Missing-data breakdown  From 100     %
    Environment1
      Missing              12     12.0
    Environment3
      Missing              12     12.0

──────────────────────────────────────

Reliability Statistics
Cronbach's Alpha  N of Items
----------------  ----------
           0.297           5

Item Statistics
Item           Mean     SD   N
------------  -----  -----  --
Environment1  2.988  1.212  82
Environment2  2.780  1.228  82
Environment3  3.134  1.163  82
Environment4  3.098  1.203  82
Environment5  2.976  1.474  82
Warning: The following item(s) are negatively correlated with the rest of the scale: Environment2.
They may need reverse-coding, or may not belong in the scale - check the item-total table and the item wording.

Item-Total Statistics
Item          Corrected Item-Total r  Alpha if Item Deleted
------------  ----------------------  ---------------------
Environment1                   0.505                 -0.115
Environment2                  -0.615                  0.749
Environment3                   0.536                 -0.129
Environment4                   0.365                  0.040
Environment5                   0.340                  0.012

Variable Labels:
  Environment1 = Climate change is a serious threat.
  Environment2 = Concern about the environment is exaggerated. R
  Environment3 = Government should do more for the environment.
  Environment4 = I would pay more for environmentally friendly products.
  Environment5 = Pollution is a major cause of public health problems.

jalpha() flags Environment2 as reverse-keyed: it correlates negatively with the rest, which drags the reliability down. The label legend tells you why — Environment2’s label ends in “R”, the marker this dataset’s author used for a reverse-coded item. Your own data won’t always be flagged so conveniently, but the label wording is often the tell on its own: a positively phrased item sitting among negatively phrased ones is exactly the kind of thing that surfaces here. We fix it by recoding the item into a new column that runs the same direction as the others. jrecode() returns the recoded variable, which we store as a new column:

community$Environment2R <- jrecode(community, Environment2,
                                   map = "1=5; 2=4; 3=3; 4=2; 5=1")

Note: jrecode() returns the recoded values; assign them to a column to keep them:
  community$<name> <- jrecode(...)
To check the recode landed correctly, compare jfreq() on the original and the new column.

That map = "1=5; 2=4; ..." is the same idea as SPSS’s RECODE Environment2 (1=5)(2=4)(3=3)(4=2)(5=1) INTO Environment2R. Storing the result in a new column (Environment2R) rather than overwriting the original is the safe habit — the original item stays intact in case you need it.

Run jalpha() again with the reversed item in place of the original:

jalpha(Environment1, Environment2R, Environment3, Environment4, Environment5)
Reliability Analysis
Using default data frame: community

Case Processing    Excluded  Remaining
    Original              —        100
    Auto-listwise        18         82
    Analysis N            —         82

Missing-data breakdown  From 100     %
    Environment1
      Missing              12     12.0
    Environment3
      Missing              12     12.0

──────────────────────────────────────

Reliability Statistics
Cronbach's Alpha  N of Items
----------------  ----------
           0.798           5

Item Statistics
Item            Mean     SD   N
-------------  -----  -----  --
Environment1   2.988  1.212  82
Environment2R  3.220  1.228  82
Environment3   3.134  1.163  82
Environment4   3.098  1.203  82
Environment5   2.976  1.474  82

Item-Total Statistics
Item           Corrected Item-Total r  Alpha if Item Deleted
-------------  ----------------------  ---------------------
Environment1                    0.631                  0.744
Environment2R                   0.615                  0.749
Environment3                    0.728                  0.716
Environment4                    0.601                  0.754
Environment5                    0.387                  0.832

Variable Labels:
  Environment1  = Climate change is a serious threat.
  Environment2R = Concern about the environment is exaggerated. R (recoded)
  Environment3  = Government should do more for the environment.
  Environment4  = I would pay more for environmentally friendly products.
  Environment5  = Pollution is a major cause of public health problems.

With the item reversed, reliability improves — but the diagnostics now point at Environment5: it has the weakest corrected item-total correlation and the highest “alpha if item dropped,” meaning the scale would be a touch more reliable without it. So we drop it and build the scale from the four items that hang together, as a per-case mean with javg(), requiring at least three of the four present so a respondent who skipped one still gets a score:

community$EnvScale <- javg(Environment1, Environment2R, Environment3, Environment4,
                           min.valid = 3, var.label = "Environment scale (mean)")
Mean of 4 variables computed for 100 cases (min.valid = 3: 12 cases used partial data, 6 set to NA due to missing values).
Mean of the new variable: 3.131.

Note: javg() returns the scores; assign them to a column to keep them:
  community$<name> <- javg(...)
For the full distribution (min, max, SD), run jdesc() on the new column.
jdesc(EnvScale)
Descriptive Statistics
Using default data frame: community

Case Processing  Excluded  Remaining
    Original            —        100
    Remaining N         —        100

────────────────────────────────────


Variable  Total  Non_missing  Min  Max   Mean     SD
--------  -----  -----------  ---  ---  -----  -----
EnvScale    100           94    1    5  3.131  0.972

Variable Labels:
  EnvScale = Environment scale (mean)

Book 1 covers what Cronbach’s alpha measures and why a reverse-keyed item has to be flipped before it’s trusted; Book 2 covers jalpha()’s and javg()’s options — the min.valid cutoff, alternative reliability handling — in more detail, with worked examples. The help pages ?jalpha, ?jrecode, and ?javg document them as well.

Run the analysis

With a clean, prepared dataset, the analysis itself is short. jstats uses the same DV ~ IV formula you’d write in base R, so a linear regression of wellbeing on income and age reads naturally:

jlm(WellbeingScore ~ Income + Age)
Linear Regression
Using default data frame: community

Case Processing    Excluded  Remaining
    Original              —        100
    Auto-listwise         6         94
    Analysis N            —         94

Missing-data breakdown  From 100    %
    Income
      Missing              6      6.0

──────────────────────────────────────


Coefficients
               b      SE      t      β      p  
-----------  ------  -----  -----  -----  -----
(Intercept)  29.287  3.610  8.113         <.001
Income        0.000  0.000  6.416  0.549  <.001
Age           0.170  0.083  2.060  0.176   .042

Outcome:
  WellbeingScore = Wellbeing score (0-100)
Predictors:
  Income         = Annual income (USD)
  Age            = Age (years)

R-squared: 0.387    Adjusted R-squared: 0.373
Residual Standard Error: 8.925

F-statistic: 28.707 on 2 and 91 DF, p-value: <.001
Sum of Squares:
  Regression: 4573.217
  Residual:   7248.527
  Total:      11821.745

To add a categorical predictor like Region, you tell jstats to treat it as a set of dummy variables. jdummy() registers that once — choosing West as the reference category here — and every analysis that follows honors it, so you don’t re-specify it each time:

jdummy(Region, ref = "West")
Dummy Variable Registration
Using default data frame: community

  Variable: Region (haven_labelled)
  Reference category: Region_West
  Dummy variables: Region_North, Region_South, Region_East
  Cases: 100 (0 missing)
Note: this registration is stored for this session only.
To keep it across sessions, save the data frame in R format (.rds):
  jsave(community, "community.rds")

Next session, load that file to restore the registration:
  community <- jload("community.rds")
jlm(WellbeingScore ~ Income + Age + Region)
Linear Regression
Using default data frame: community

Case Processing    Excluded  Remaining
    Original              —        100
    Auto-listwise         6         94
    Analysis N            —         94

Missing-data breakdown  From 100    %
    Income
      Missing              6      6.0

──────────────────────────────────────


Coefficients
                    b      SE      t      β      p  
----------------  ------  -----  -----  -----  -----
(Intercept)       23.045  4.077  5.653         <.001
Income             0.000  0.000  6.698  0.559  <.001
Age                0.215  0.082  2.620  0.222   .010
Region_North (1)   5.391  2.466  2.186          .031
Region_South (1)   7.323  2.770  2.644          .010
Region_East (1)    5.276  2.356  2.239          .028

Outcome:
  WellbeingScore = Wellbeing score (0-100)
Predictors:
  Income         = Annual income (USD)
  Age            = Age (years)
  Region         = Region of residence

R-squared: 0.444    Adjusted R-squared: 0.412
Residual Standard Error: 8.642

F-statistic: 14.058 on 5 and 88 DF, p-value: <.001
Sum of Squares:
  Regression: 5249.590
  Residual:   6572.155
  Total:      11821.745

jdummy() is doing what you’d otherwise arrange with factor() and contrasts() / relevel(). The difference is that the registration is non-destructive and persistent: Region stays a plain labelled variable in the data, and the dummy treatment lives in a separate registration that jlm(), jlogistic(), and the others look up — so you set the reference category in one place, not inside every model call.

Logistic regression works the same way, with a 0/1 outcome. Volunteer is already coded 0/1, so it drops straight in:

jlogistic(Volunteer ~ Age)
Logistic Regression
Using default data frame: community

Coefficients
               b      SE    Wald   df   p    Exp(B)
-----------  ------  -----  -----  --  ----  ------
(Intercept)  -1.923  0.788  5.954   1  .015   0.146
Age           0.039  0.018  4.488   1  .034   1.040

Outcome:
  Volunteer = Volunteered in past year
Predictors:
  Age       = Age (years)

Omnibus Test of Model Coefficients
Chi-Square  df  p   
----------  --  ----
     4.761   1  .029

Model Summary
-2 Log Likelihood  Cox & Snell R²  Nagelkerke R²      AIC
-----------------  --------------  -------------  -------
          131.297           0.046          0.063  135.297

Dependent Variable Encoding
  Modeled (1):   Yes
  Reference (0): No

A dichotomy stored as 1/2 (the common Yes/No convention) needs recoding to 0/1 first — jstats will stop with a message telling you exactly that, rather than running something misleading. The recode is a one-liner with jrecode(), the same tool we used on the scale item; jstats with your own data shows that step in context.

Book 1 covers the regression ideas themselves — what the coefficients mean, how dummy variables encode a categorical predictor; Book 2 covers jlm() and jlogistic()’s options and the reference-category choice in more detail, with worked examples. ?jlm, ?jlogistic, and ?jdummy document the options too.

Keep one set of cases across analyses

When several variables have missing values, different analyses can quietly run on different subsets of respondents — a regression on these three variables uses a different N than a correlation on those four. When you want a run of analyses to share the same respondents, jcomplete() sets a listwise filter: only cases complete on the variables you name are used, until you turn it off.

jcomplete(Income, Education, Age)
Listwise Case Filter
Using default data frame: community

Variable     N  Missing  % Missing
---------  ---  -------  ---------
Income     100        6  6.0%     
Education  100        6  6.0%     
Age        100        0  0.0%     

  Complete cases: 88 of 100 (88.0%)
  Listwise filter activated — 12 cases will be excluded from subsequent analyses.
jdesc(Age)
Descriptive Statistics
Using default data frame: community

Case Processing  Excluded  Remaining
    Original            —        100
    jcomplete          12         88  Income, Education, +1 more
    Remaining N         —         88

────────────────────────────────────────────────────────────────


Variable  Total  Non_missing  Min  Max    Mean      SD
--------  -----  -----------  ---  ---  ------  ------
Age          88           88   18   71  41.125  11.794

Variable Labels:
  Age = Age (years)

Every analysis you run now uses just that matching set of cases, so their N’s line up and the results are comparable. (jcomplete() can also preview exactly which cases it would drop before you commit — ?jcomplete shows how.)

A clean slate when you want one

The session default, the dummy registration, and the listwise filter all persist by design — that’s what makes them convenient. When you do want to clear them (starting a genuinely separate analysis, or undoing something), each has an off switch:

jcomplete(NULL)            # drop the listwise filter
jcomplete cleared for community (had: Income, Education, Age).
jdummy(community, NULL)    # drop the dummy registrations
Dummy registrations cleared for community: Region.
juse(NULL)                 # clear the session default
Default data frame cleared.

You won’t normally clear these between ordinary analyses — leaving a default and your registrations in place is the point of having them. It’s for a fresh start, or to undo a mistake.

Dialing the detail up or down

Everything you’ve seen this session came out at jstats’s standard detail level — the default. joutput() also takes an overall level, so you can trim the output or expand it in a single move:

joutput("minimal")    # core results only — leanest output
joutput("standard")   # the default you've been seeing
joutput("full")       # fullest detail — diagnostics, extra tests, complete processing summaries

minimal strips the output down to the essential result tables — handy for a clean report or a production script; full turns on the additional diagnostics and notes that standard leaves off. Important warnings and errors always appear, whatever level you choose.

Each level is a preset for a whole set of individual switches — effect sizes, confidence intervals, the Case Processing Summary, and more — and you can override any one of them on its own, as you did earlier with variable.id. ?joutput lists every setting and what each level turns on; Book 2 covers output control in more detail, with worked examples.

Save your work, and open a file someone sends you

A finished session usually ends by saving the prepared data. jsave() writes the in-memory data frame back out to a file, and jload() reads it back; together they’re the round trip. Across formats — R’s own .rds, plus SPSS, Stata, and Excel — jstats carries value labels, variable labels, and missing-value declarations along as faithfully as each format allows.

jsave(community, "community.rds")     # save the prepared data
jload("community", overwrite = TRUE)  # read it back

Saving across formats is where the missing-value distinctions earn their keep — an SPSS-style code can’t write straight to a Stata file, and Excel stores no missing-value metadata at all, so jstats handles (and warns about) each case. jstats with your own data walks the full cross-format round trip live; Book 2 covers the fidelity rules in more detail, with worked examples. ?jsave and ?jconvert document the options.

Everything above used files you’d save yourself. The package also ships an example SPSS file inside itself — a stand-in for a data file a colleague emails you. system.file() is base R’s way of finding a file bundled inside an installed package; jload() then opens it like any other received file, missing-value codes and all:

received <- system.file("extdata", "community_spss.sav", package = "jstats")
jload(received, name = "community_received", overwrite = TRUE)
Loaded community_received (SPSS format; 100 cases, 15 variables)
5 variables have user-defined missing values:
  Income: -99 ["Refused"], -98 ["Don't know"]
  Education: -99 ["Refused"], -98 ["Don't know"]
  Smoker: -99 ["Refused"]
  Environment1: -99 ["Refused"], -98 ["Don't know"]
  Environment3: -99 ["Refused"], -98 ["Don't know"]
These codes are excluded as missing in jstats analyses. For better base R compatibility, convert them:
  jconvert(community_received, to = "stata")  - retains missing-value codes, base R compatible (recommended)
  jconvert(community_received, to = "baseR")  - converts to plain NA and removes missing-value codes
jscreen(community_received, stats = TRUE)
Data Screening
  Cases: 100 
  Variables: 15 
  Cases with missing data: 30 
  Variables with outliers: 0 

Variable Types
Variable        jstats Class  Sub-class   Unique Values       Mean   Median
--------------  ------------  ----------  -------------  ---------  -------
RespondentID    Categorical   identifier            100                    
Income          Numeric                              52  47414.894  48000.0
Education       Categorical   5-category              5                    
Age             Numeric                              41     40.660     40.0
WellbeingScore  Numeric                              43     50.600     51.0
Volunteer       Categorical   dichotomy               2      0.420         
OwnsHome        Categorical   dichotomy*              2      1.540         
Smoker          Categorical   dichotomy               2      0.421         
CommuteTime     Numeric                              46     31.340     30.5
Region          Categorical   4-category              4                    
Environment1    Categorical   Likert                  5                    
Environment2    Categorical   Likert                  5                    
Environment3    Categorical   Likert                  5                    
Environment4    Categorical   Likert                  5                    
Environment5    Categorical   Likert                  5                    
* coded other than 0/1; mean is not a proportion

Missing Data & Outliers (outliers > 3 SD from mean)
Variable      Missing  % Missing
------------  -------  ---------
Income              6        6.0
Education           6        6.0
Smoker              5        5.0
Environment1       12       12.0
Environment3       12       12.0

Variable Labels:
  RespondentID   = Respondent ID
  Income         = Annual income (USD)
  Education      = Highest education level
  Age            = Age (years)
  WellbeingScore = Wellbeing score (0-100)
  Volunteer      = Volunteered in past year
  OwnsHome       = Owns home
  Smoker         = Current smoker
  CommuteTime    = Daily commute time (minutes)
  Region         = Region of residence
  Environment1   = Climate change is a serious threat.
  Environment2   = Concern about the environment is exaggerated. R
  Environment3   = Government should do more for the environment.
  Environment4   = I would pay more for environmentally friendly products.
  Environment5   = Pollution is a major cause of public health problems.

And that’s a full session, end to end: opened the data, screened it, cleaned a messy column, built and checked a scale, run the models, kept the case base consistent, and saved the result — with the everyday conveniences (juse(), jdummy(), jcomplete()) falling into place as they came up rather than as separate lessons.

This page is the narrative of a session, not a manual: to keep it moving it left a good deal unexplained — including output you saw but we didn’t stop on, like the Case Processing Summary that headed several of the analyses (the small table reporting how many cases an analysis used and how missing values were handled). For looking a single task up while you work — “how do I reverse-code an item,” “how do I get a correlation matrix” — see the Reference page, which links every function to its full help and its options. And to go further than these guides, the book series covers the analyses and their output in full: Book 1 builds the statistics and R together from the start, and Book 2 is the option-by-option guide — output settings, the Case Processing Summary, and the rest — for readers who already know the statistics.