I went to a really interesting talk at Foo Cafe in Stockholm tonight. I stayed almost 2 hours over my time limit, but it was so worth it. The talk was by Martin Neumann, and he gave us an introduction to big data and some tools commonly used to wrangle it. Below are the notes I took throughot the talk. I cleaned them up a tiny bit, but mostly these are just for my personal benefit and are super messy - if you didn’t attend the talk they may not be that useful.
Martin gave a couple of really great real world examples of companies he’s worked with on various big data projects. Though he had permission to use the examples in the talk, I’m not really sure if he’d mind my posting specific company names here, so I’ve decided to just refer to the companies in very general category terms. If I get the OK to replace this with actual names later I’ll do that.
- 95% of the data we have now was produced in the last 2 years
- Content Streaming Company X: ~40TB of log data every day
- 85% of data stored today is unstructured
WHAT IS BIG DATA
The “Five Vs”.
- Tables, files
Volume challenges: Storage; Redundancy/backups; discoverability/filtering; processing/analysis
Traditional data: database, structured fields, rules, etc. Big data: more of a dump, Unstructured - videos, text, random urls, etc
Variety Challenges: Data is often quite noisy, needs cleaning; no context in data; disambiguation - is it about Paris the city or Paris the person?
Content Streaming Company X example: they get meta info about content from the labels; people manually add labels OR they have a db connection. Which “Kent” is this new piece of content from? There are 7-8 creators with name Kent. no global identifier for creators.
- Origin, reputation
Challenges: data quality is quite low; errors, missing data, unexpected structure, etc.
Content Streaming Company X eg: metadata from labels may be just wrong; some creators put in misleading data intentionally to broaden viewership.
Data and knowledge is not the same thing. What are these 5 million tweets about?
Challenges: data is too big to just look at/visualize. Traditional analytics tools often do not work at large scales
Data comes at a very fast rate. NY Stock exchange produces ~1TB of trade info per day. Not that big. But you need to process it in milliseconds. Storage on single machine would be way too slow.
Multiple machines because:
- data does not fit on a single machine
- single machine cannot process data in time
- data too complex to analyse on single machine
With distributed and big data systems, failure is normal/the norm. Something fails all the time.
CAP Theorem: there can only be two: Consistency, Availability, Partition Tolerance
- Availability: server/data is reachable. Request always gets timely response.
- Partition Tolerance: handling sudden disconnects; user does not see that the system may be down or partitioned
- Consistency: if you have written something into the system or db, either everyone sees it or nobody does. (an issue I see quite often with Twitter - I guess they picked Availability and Partition Tolerance!)
There are some levels, but in general pick 2 of the above.
- Consistency & Partition Tolerance - stop and wait, sacrifice Availability
- Consistency & Availability - not often used, can’t really live with 2 independent subsystems
- Availability & Partition Tolerance - most popular in distributed systems field.
Bank transactions/systems are not usually done in big data stacks; there is usually a single server that runs that stuff somewhere.
- Some tricks exist to work with the above, like never modifying data.
- Estimated Google data centers have disk failure every 15 minutes (not sure how accurate?)
Data Analysis Framework
- Host Machine
- Data Source -> storage -> Analytics/processing
- Storage -> visualization
Big Data systems based on above:
- open source server side data processing pipeline
- gather data, package it, send to storage solution
- Distributed file system
- Open source implementation of googl
- Has Coordinator node
- Big file (eg 40 TB) -> Storage Nodes (one file split between nodes in fixed size pieces)
- Coordinator gives space, file is written in chunks, gets placed on different nodes (and replicated to different servers as backups)
To Read: Ask coordinator for file, it reconstructs file from the storage nodes. Chunks can be retrieved in parallel
Speed more dependent on network vs machine hard drive. Fast reads. This system is industry standard in big data storage. Usually installed on cloud servers/VMs. Storage nodes need to communicate with each other.
KAFKA message broker; Low latency storage
Used for streaming data
- Stream -> Broker nodes
- Lots of small events come in, eg through network port. You store them on broker nodes/multiple machines.
- Machines/nodes replicate the data from one to another. One master node/coordinator figures out who is alive etc.
- You/client ask coordinator for location of data when system starts, initially, then you read directly from the broker nodes after you are up and running.
Unlike batch storage, this is a message broker. Storage is there “forever”. MB has a limited lifespan for its messages/events. If you need a data long term you can write it from message broker to HDFS.
- HDFS and Kaska both have immutable data.
- Neo4j is world leading graph db, focusing on storing graph shape data
- Content Streaming Company X use Kafka and HDFS
- Apache Spark
- Apache Flink
- Hadoop Map Reduce
- Grandfather of the above, a little outdated.
- Not to be confused with Hadoop HDFS.
- Storage Hadoop is not the same as Compute Hadoop.
- If you want to use Map Reduce you HAVE to use HDFS. But you can use HDFS without Map Reduce (commonly used this way).
- Write map functions to apply to each block in storage; write reduce functions to smash the data together.
- For count: count elements in each block separately, then send to central reduce function to sum them up.
- Very low level and complicated for ML purposes for example.
Working on functional program against a data stream.
- Spark and Flink are closer to db world, can do stuff like joins;
- Unlike HMR, both Spark and Flink can work with stream storage. Flink better for streaming, Spark is better for batch/big files.
- Spark vs Flink:
- Spark better for machine learning;
- There are two because they evolved at the same time,
- fundamentally work in a different way; some things flink does better/differently in terms of stream processing.
- stream processor is basically where you can say the stream never ends. fundamental difference is that spark is a batch processor that can do a bit of streaming, flink is a stream processor that can do batch. middle of the road usages are pretty much the same.
- complicated streaming: go for flink, complicated batch and/or ML: go for spark,
- Content Streaming Company X started out using MapReduce and HDFS, but now have moved onto faster tech for computation but still use HDFS
- Streaming vs batch:
- For fraud detection/real time processing you want streaming. For later on analysis that is not time critical you can do in batching.
- Most things you can do in batch you can also do in stream, except machine learning. Most machine learning needs to go over the data multiple times repeatedly. This can mostly only be done in batch processing, though there are a few streaming projects in their infacny developing.
Kibana + ElasticSearch:
- Record -> Data Nodes
- ES: storage solution, Kibana: visualization solution
- Records can be structured or unstructured (txt files)
- ES: Nice REST interface, can do everything from command line
- DO NOT USE as traditional database data storage. If you have 2 records coming in at the same time for the same data point, one will randomly win - the other will not show up. Does not guarantee ordering or consistency.
- Data is mutable, but you have to be aware of the ordering issue above.
- Typically used for log files
- Client sends request to the master with a query
- Master sends query to all data nodes, data nodes look at local data, evaluate query, send partial results up to master which will join results back together and give them back to client. Load is always generated on every node.
- fast to get results
- Limitation: Only a certain amount of data points you can look at. Need to be sure to ask the right kind of question/make the right query.
- Used by Big Tech Company Y & Content Streaming Company X to do stuff like monitoring services;
- Deploys VMs, can be run on your own hardware or cloud servers like AWS
- Your own data centers
Hops (hadoop for humans)
- Platform as a service “done right”.
- Startup, very new.
- cloud platform.
- Runs on a data center and gives access to the entire stack of aforementioned tools.
- create project, provide data source, tell it if you want to use spark + kibana + ES etc. It will connect the tools, secure them, etc.
- Originated as a research project from Stockholm
- Similar to openstack but does not do VMs, instead it creates containers (smaller and less overhead than VM).
- More lightweight than openstack
- faster, smaller, more agile than openstack, can run on single HW machine.
- Containers can run in parallel on one machine.
- From Google.
- Powerful, but very complex setup and maintenance process;
- don’t need to deal with hardware but still have all the maintenance
- Same challenges as AWS
Your own hardware
- As with AWS and Azure, very complex setup and maintenance process;
- Openstack, kubernetes, hops kind of get rid of the configuration load. All open sourcs.
- Big Tech Company Y is thinking about moving away from their own hardware to cloud servers
- For Content Streaming Company X everything is stored in hdfs and everyone who works in hdfs has access to everything.
Example 1: Big Tech Company Y Research (not in productoin, partial success)
- Data sources, cluster with lots of machines producing log files -> logstash gathers the files -> files pushed to kafka -> consuming data in flink and doing anomaly detection
- Also from Kafka -> elastic search -> kibana
- Running on openstack - yarn
You don’t want to find one random error message, but the sweet spot of likely bad cases that you can still do something about before everything explodes. System was taking error messages and predicting if it was still OK or if something is so abnormal that action needs to be taken.
Challenge: Not enough time to build it, the algorithm was complicated. Algorithm IS in use in different products, eg anomaly detection for ship movements.
Example 2: Car industry, research prototype, not deployed yet
- Data was gathered from car hazard lights, sent to manufacturer.
- car data -> kafka -> kafka to Enrichment flink, flink back to kafka, kafka to Analytics Flink, back to Kafka, and Kafka to Kibana/ES
- open street map -> enrichment flink
- Two flink processes (Enrichment + Analytics). Enrichment process tries to see if you are in city or middle of the woods, for example. Enrichment results go back to kafka.
- kafka, flink processes, elasticsearch managed by hops.
Analyze hazard warning: someone parking in the wrong lane vs traffic jam? what warning lights are more relevant to others?
Challenge: Was no big data in the system - 100 events or so. Had to simulate large amounts of events to show infrastructure capability
Example 3: Big Tech Company Y production project
- Hardware (like a car or other device) -> Mediation -> Kafka -> Kafka Streams;
- From Kafka also to Kibana + ES
- Running on Kubernetes
- Deals less with message content vs management of infrastructure
- 7-8k messages per second, load tested for 4000 per second; already running over capacity. peak times can get up to 20k messages per second
Big Tech Company Y switch from openstack to kubernetes because:
- Big Tech Company Y Research is different department as Big Tech Company Y production. Things vary between departments
- openstack was internal cluster for research, different requirements.
- growing fast is a lot less painful with kubernetes. (openstack was using local hardware, resource requirements were predictable)
Container solutions like kubernetes will likely replace VM solutions like openstack. kubernetes is much smaller and faster, spinning up a VM takes a long time (~15 mins). kubernetes can spin up a new container in ~2 seconds. If something fails you can spin something back up to replace it before anyone notices.
What are some challenges you are having?
- For the specific Big Tech Company Y production project, people are not aware what this kind of tech can do.
- difficult to talk to a typical SE and figure out what they want/need.
- field is very complicated; different techs with pros, cons, etc. need to be careful to make right decision early on, which is difficult if the person setting the requirements does not know fundamentals.
- Basics like CAP theory should be part of every CS curriculum.
- People know the traditional DB world, but do not yet understand big data processing/ML. It will not work the same way as an SQL db would, and people should not expect it to. Without basic concepts people cannot begin to imagine what this stuff can be used for.
- Tools are changing quickly, the basic concepts are not. Eg CAP theory will stay.