These are my live notes from the session Ross Turk of Inktank presented today at Cloud Connect Chicago.
Exp of how things scale bigger and bigger – hotels (e.g hotels in Vegas). What if a hotel had a million rooms? Its always under construction, self-maintaining.
At some point scaling becomes impractical.
Cool yet crazy elevator example to explain the Ceph project.
As Ceph was being designed, these were the considerations:
- Philosophy – open source, community-focused
- Design – scalable, no single point of failure, software based (not appliance based), self-managing
Architecture:
- Object store (RADOS) is the base
- OSD runs on top of an FS (brtfs, xfs,) 1 per disk (recommended), 3 required to make a cluster, serves objects to clients, intelligently peer to perform replication tasks)
- Monitors – member of cluster that maintains cluster map – not info-serving
- Librados – programming app that allows you to interact with the object store (no http overhead)
- RADOSGW – Rest gw that is compatible w S3 and Swift. Southbound = native, rest = northbound. Supports buckets and accounting
- RBD – RADOS block device. Assembles blocks from the cluster in a fashion that can be mounted as block or as virtual machine. Interesting ways to live migrate VMs.
- CephFS – posix-compliance distributed fs. Meta data needs to be managed – so there is a metadata server only if you are running CephFS.
- Metadata servers are clustered, they don’t share data
- Crush – is algorithm that stores and retrieves data
- hashes object name, puts it into placement groups
- gets passed to cluster, with cluster set and rule set
- all work happens on the client
- its a psuedo random placement group.
- Configuration is rule-based
- If a node gets losts, ceph recalculates crush and figures out where duplicate should be and moves it there, so its ready for the client when it calculates