整理和翻新一下自己看过和笔记过的Big Data相关的论文和Blog
Streaming & Spark
Discretized Streams, 离散化的流数据处理
Spark - A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Sparrow - Distributed, Low Latency Scheduling
Linkedin Ecosystem
The Log: What every software engineer should know about real-time data‘s unifying abstraction
Kafka: a Distributed Messaging System for Log Processing
Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN
Google Ecosystem
bigtable: A Distributed Storage System for Structured Data
Dremel - Interactive Analysis of WebScale Datasets
Chubby - lock service for loosely-coupled distributed systems
Megastore - Providing Scalable, Highly Available Storage for Interactive Services
NoSQL
一致性问题
Why Vector Clock are Easy or Hard?
索引技术
数据模型
NoSQL Data Modeling Techniques
系统
Dynamo: Amazon’s Highly Available Key-value Store
Cassandra - A Decentralized Structured Storage System
Hadoop Ecosystem
YARN - Yet Another Resource Negotiator
数据分析和挖掘
海量文档查同或聚类问题 -- Locality Sensitive Hash 算法