Here are my (mostly) unedited notes from the session Understanding big data by Bradford Stephens and Jeremy Edberg.
- What is Big Data?
- When you have so much info on 1 machine that you can’t get to it
- When it took over an hour to process an hour’s worth of traffic – reddit knew they had a big data problem
- Problems developers will see:
- SQL stops working
- Performance exponetially decays
- Problems Ops will see:
- SPOFs/Cascading failures
- Keep hiring ops – can’t keep up with installations, deployments, failures, etc
- Problems Biz will see:
- Product failures – customers leave
- Exponential hardware cost
- Big data solutions are similar to moving to cloud
- For dev:
- For Ops:
- Automate ops with development (puppet – auto-configure servers)
- Build for seamless future
- For the Biz:
- Hire for scale from day 1 (<-hire best engineers – you are building systems, you need engineers not hackers or script kiddies)
- Clusters of commodity hardware (don’t need huge up front expenditures)
- Big Data in action – Reddit
- 1st tried to shard data across db’s
- Still couldn’t read data out fast enough
- eventually consistent readings not good enough
- Big Data in action – Drawn to Scale
- Single node, can’t keep up.
- Go to shard solution – gives advantage of storing across computers. Disadvantage is you can’t do simple sql queries across distributed nodes, and they can’t use distributed indexes
- Used hTable model to make indexes distributed
Pingback: Interop – week in review | Storage according to a dixie chick