This repository contains mdbplyr, an R package that
provides a disciplined, lazy dplyr-style interface for
MongoDB aggregation pipelines.
mdbplyr sits between raw mongolite usage
and broader compatibility layers. Compared with writing aggregation JSON
by hand, it lets you express supported queries with tidy verbs and
inspect the generated pipeline before execution. Compared with
approaches that try to hide MongoDB behind table-like semantics, it
stays explicit about scope, keeps translation conservative, and fails
clearly when a verb or expression is outside the supported subset.
The practical advantages are:
collect(),cursor(),show_query() for inspectable pipeline output,Suppose there is a running MongoDB instance on localhost
with default port and no authentication. The code below loads
dplyr::starwars into a collection named
starwars.
library(dplyr)
library(mongolite)
library(mdbplyr)
starwars_collection <- mongolite::mongo(
collection = "starwars",
db = "mdbplyr"
)
starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)
starwars_tbl <- tbl_mongo(
starwars_collection,
schema = names(dplyr::starwars)
)Once the collection is loaded, starwars_tbl is the lazy
table used in the examples below.
mdbplyr uses a schema to know which fields are available
in a collection. This matters especially for:
`message.timestamp` or
`message.measurements.Fx`;filter(), mutate(), and
summarise();select(),
flatten_fields(), and unwind_array().The most reliable approach is to pass schema = ...
explicitly when creating the lazy table:
starwars_tbl <- tbl_mongo(
starwars_collection,
schema = c("name", "species", "height", "mass", "homeworld")
)When writing the schema by hand is inconvenient,
infer_schema() can populate it from the first document in
the collection:
This is convenient for exploratory work, but it has an important limitation: it only sees one document. If the collection is heterogeneous, fields that do not appear in the first document may still need to be added manually.
You can inspect the currently known fields with:
Inspect the known schema and the generated pipeline without executing the query.
Each subsection below shows one of the supported
dplyr-like verbs on the starwars
collection.
select()Selecting dotted paths preserves nested MongoDB structure by default. It does not flatten nested fields unless you explicitly ask for that:
mutate()arrange()summarise()flatten_fields()Use flatten_fields() when you explicitly want nested
object leaves to become flat tibble columns. By default the output names
are the schema dot paths.
You can also target a specific nested root and optionally rename the flattened output columns:
unwind_array()Use unwind_array() when a document field contains an
array and you want one output row per array element.
If array elements are nested objects, unwind_array() and
flatten_fields() can be chained:
The examples above stay within the currently supported subset:
.data$... for
explicit field references and .env$... for local
values,mutate() and
transmute(), plus the special 1:n()
row-numbering case,flatten_fields(),unwind_array(),If a query falls outside that subset, mdbplyr is
designed to fail explicitly rather than guess or silently change
execution semantics.