Pydio Cells v3.0 - Accelerated Performance
Cells V3 comes with tons of improvements for an even more streamlined user experience. One of the main focuses for our development team was server speed and performance under the hood. This article will look at the new datasource format we implemented, improvements to internal communication between services, and caching optimizations.
1 - New “Flat” Datasources
Pydio users often ask for the ability to "keep the tree structure" of files visible on the storage. To achieve this, Cells’ datasource relied on unidirectional synchronization between storage and the internal index. While it allowed files to be modified directly without going through Pydio, it also brought its share of issues and performance limitations with large numbers of files – so much so that sometimes a "resync" of the datasource was required to fix index issues.
The “structured data” flow:
Cells V3 introduces a new datasource format that keeps the tree structure only in Cells’ internal indexes and stores files as a flat structure on the storage. This is more in line with the "object storage" design and brings huge performance gains, as a "sync" is no longer required to maintain the indexes. Where structured datasources used to need to wait for storage events to update indexes, flat datasources now directly update indexes at upload/modification.
The “flat data” flow:
This allows the files to appear more quickly in the UX and, above all, fixes the issue of moving/renaming huge folders. Instead of reacting to events and applying complex algorithms to detect data changes, move/rename is just a matter of updating the index while the data itself is left untouched. This basically improves the move/rename operation (inside a datasource) by a factor of 10 or 100, depending on the folder size!
2 - Faster Internal Communication
At its core, Cells is designed with a “microservice” architecture, splitting all domain-specific features into independent services that communicate with each other via APIs. This is a great way to provide stability (as long as the API “contract” is honored, everything works as expected, even if the underlying implementation is totally rewritten) and scalability (each microservice can be distributed or replicated on multiple servers).
This type of architecture requires a “message bus” to convey all event-based and request/response communication between all services. To not reinvent the wheel, Cells is based on NATS.io technology to deal with multi-node deployments. But in many cases, Cells is deployed on a single-node server, and the developer team found out that we could skip the NATS network layer when running Cells on a single machine.
Adding an “in-memory” communication layer simplifies communication and improves performance dramatically, and eliminates the need for NATS on single-node deployments. The internal web proxy also benefits by implementing its own internal DNS resolver, avoiding the multiple “caddy restarting…” that could be seen at startup while all services were not yet running.
The biggest performance gains can be seen in start-up/restart time Cells, which have decreased from an average of 20s to 8s.
3 - Caching to Improve Response Times
The team also introduced a number of new caching mechanisms that improve response time on many frequently used requests. Typically, user Access Control Lists have to be checked on each request to ensure users have the proper access permissions. These ACLs don’t change frequently, so they’re a good candidate for in-memory caching. Files’ and folders’ internal data can also be cached with a short “Time To Live” to improve response times for high-frequency requests.
These are small changes, but they have yielded a 5X impact on performance for these functions.
These cache layers are all based on storing data inside process RAM. A hard limit is fixed on each cache, with a default value of 8MB. If you have plenty of memory on the server, we recommend raising this limit using the CELLS_CACHES_HARD_LIMIT environment variable, expressed in MB.