Mastermetadata
From flud
Contents |
[edit] Master Metadata
It is increasingly obvious that flud's current scheme for managing the master metadata is not very performant, and lacks reliability and scalability characteristics present in the rest of the system.
[edit] Current Scheme
Mastermetadata is a manifest of files, with lookup by filename into convergent keys to retrieve the files. It is stored as one large file in the system, alongside other user files.
[edit] Problems with Current Scheme
- Nonperformant because as the list of files grows, the modification of the mastermetadata file becomes a bottleneck (imagine a user with millions of files, a mastermetadata file that can reach into 100s of MB, making a single modification to a small (10K?) text file each day).
- Unreliable because the mastermetadata contains pointers to all other files. If this is lost, all is lost (only true in the catastrophic case where the source of the metadata is also lost -- but that is, after all, what we are trying to protect).
- Nonscalable for the same reason as nonperformant.
[edit] Possible Alternatives
The most attractive alternative at the present is to carve the mastermetadata up into directory-level chunks, and store each one at a location determined by the path name and user's id. This would allow the root user id lookup to return directory/file names at the root level of the filesystem, and lookups on each of those nodes to return directory/file names corresponding to subdirectories.
[edit] Advantages
- Finer-grained control over chunks of the metadata, and also allows us to remove redundant pathname information from filename entires in the mastermetadata itself (reducing overall size of these structures).
- More performant and scalable backup operations, since modifications only affect chunks involved instead of entire list of all files.
- Less performance degradation as mastermetadata grows (a definite problem in the current system)
- Better reliability, since it becomes much less likely that all chunks would disappear at the same time.
- Better scalability, for the above reasons
- Introduces directories as first-class objects, giving us the ability to store metadata about these efficiently.
- Possibility to introduce other filesystem-like efficiencies (links, etc).
[edit] Disadvantages
- If stored in the DHT (either data itself or pointers to it), would need frequent updates to keep the data from being purged, requiring source to effectively 'walk' the directory structure occasionally
- Slightly less-performant restore in the catastrophic recovery scenario, because in order to reconstruct the mastermetadata, will need to walk the entire structure over many remote nodes. Note that this isn't a problem for backup on non-catastrophic restore, since the owning node will have full mastermetadata cache locallly. Also note that this likely doesn't affect perceived performance, since we can start recovering files from the first few results as we continue to descend through the directory structure.