sparklyr committers and
public technical
mailing list,
We are ready to kick off the sparklyr 1.5 release! As usual, please vote on the
release branch at your earliest convenience and let me know if there is any objection to submitting sparklyr 1.5 to CRAN on next Monday (Nov 23rd).
Unlike the previous 3 releases, this release will be more focused on improving existing sparklyr features rather than creating new ones. So far, some highlights from this release include the revamped non-arrow serialization routines, addition of 4 new functions to the sdf_* family (2 of which inspired by tidyr), plus manyl dplyr-related improvements and bug fixes:
- Non-arrow serialization routines of sparklyr (which were previously based on CSV file format) were replaced with new ones based on RDS format version 2, benefiting sparklyr in multiple ways:
* The new serialization format, unlike CSV, enables binary data to be easily transported from R to Spark and allows long-standing serialization headaches such as
2031 and
2763 in sparklyr to be resolved in conceptually simple manners.
* Raw columns within a R dataframe can now be efficiently imported to Spark as binary columns.
* `copy_to()` becomes at least 30% faster with RDS serialization -- It is still nowhere as fast as arrow-based serialization. However, the primary goal of rewriting non-arrow serialization routines was improving correctness rather than performance. Any increase in performance was considered a bonus rather than a criterion of success.
* Spark-based parallel backend (aka 'doSpark') takes advantage of this new serialization format to execute `foreach` loops more efficiently.
- Equivalents of `tidyr::unnest_wider()` and `tidyr::unnest_longer()` for Spark dataframes were implemented as `sdf_unnest_wider()` and `sdf_unnest_longer()` in sparklyr.
- `sdf_partition_sizes()` was implemented to compute partition sizes of a Spark dataframe efficiently
- `sdf_expand_grid()` now provides roughly the Spark equivalent of `expand.grid()` functionalities
- Subsetting operator (`[`) for selecting a subset of columns of a Spark dataframe will be supported in sparklyr 1.5. This is mostly useful within the context of dplyr verbs operating on multiple columns.
- A number of dplyr-related improvements. The full list can be found
here
You can also find a more detailed update about this release in
here.
Thanks!
Yitao