CASSANDRA简介
Why Cassandra
MySQL drives too many random I/Os
File-based solutions require far too many locks
Cassandra vs MySQL with 50GB of data
| MySQL | Cassandra |
| ~300ms write | ~0.12ms write |
| ~350ms read | ~15ms read |
分布式领域CAP(Consistency, Availability, Partition tolerance)理论
Consistency(一致性), 数据一致更新,所有数据变动都是同步的
Availability(可用性), 好的响应性能
Partition tolerance(分区容错性) 可靠性
定理:任何分布式系统只可同时满足二点,没法三者兼顾。(这个定理并没有被证明)
忠告:架构师不要将精力浪费在如何设计能满足三者的完美分布式系统,而是应该进行取舍。
Cassandra特点
- High availability高可用性
- Incremental scalability可扩展
- Eventually consistent最终一致性
- Tunable tradeoffs between consistency and latency
- 灵活的schema,不需要象数据库一样预先设计schema,增加或者删除字段非常方便(on the fly)。
- 支持range查询:可以对Key进行范围查询。
- Minimal administration
- 单点故障不影响集群服务No SPF (Single Point of Failure)
p2p distribution model — which drives the consistency model — means there is no single point of failure. 基于Gossip协议. Gossip是一种分布式协议,在p2p系统中分发消息。良好的可扩展性,更强的健壮性。
安装
下载解压
编辑conf/storage-conf.xml
<CommitLogDirectory>/vol/cassandra/commitlog</CommitLogDirectory>
<DataFileDirectories>
<DataFileDirectory>/vol/cassandra/data</DataFileDirectory>
</DataFileDirectories>
<CalloutLocation>/vol/cassandra/callouts</CalloutLocation>
<StagingFileDirectory>/vol/cassandra/staging</StagingFileDirectory>
<Seeds>
<Seed>192.168.0.23</Seed>
<Seed>192.168.0.24</Seed>
</Seeds>
<ReplicationFactor>2</ReplicationFactor>
启动bin/Cassandra
停止bin/stop-server
Keyspace
Cassandra中的最大组织单元,里面包含了一系列Column family,Keyspace一般是应用程序的名称。你可以把它理解为Oracle里面的一个schema,包含了一系列的对象。
Column family(CF)
CF是某个特定Key的数据集合,每个CF物理上被存放在单独的文件中。从概念上看,CF有点象数据库中的Table.
Key
数据必须通过Key来访问,Cassandra允许范围查询,例如:start => ’10050′, :finish => ’10070′
Range queries require using an order-preserving partitioner.
Column
在Cassandra中字段是最小的数据单元,column和value构成一个对,比如:name:“jacky”,column是name,value是jacky,每个column:value后都有一个时间戳:timestamp。
和关系数据库不同的是,Cassandra的一行中可以有任意多个column,而且每行的column可以是不同的。从数据库设计的角度,你可以理解为表上有两个字段,第一个是Key,第二个是长文本类型,用来存放很多的column。这也是为什么说Cassandra具备非常灵活schema的原因。
Super column
Super column是一种特殊的column,里面可以存放任意多个普通的column。而且一个CF中同样可以有任意多个Super column,一个CF只能定义使用Column或者Super column,不能混用。下面是Super column的一个例子,homeAddress这个Super column有三个字段:分别是street,city和zip:
{ // this is a SuperColumn
name: “homeAddress”,
// with an infinite list of Columns
value: {
// note the keys is the name of the Column
street: {name: “street”, value: “1234 x street”, timestamp: 123456789},
city: {name: “city”, value: “Beijing”, timestamp: 123456789},
zip: {name: “zip”, value: “100001″, timestamp: 123456789},
}
}
Columns and SuperColumns are both a tuples w/ a name & value. The key difference is that a standard Column’s value is a “string” and in a SuperColumn the value is a Map of Columns. That’s the main difference… their values contain different types of data. Another minor difference is that SuperColumn’s don’t have a timestamp component to them.
Sorting
不同于数据库可以通过Order by定义排序规则,Cassandra取出的数据顺序是总是一定的,数据保存时已经按照定义的规则存放,所以取出来的顺序已经确定了,这是一个巨大的性能优势。有意思的是,Cassandra按照column name而不是column value来进行排序,它定义了以下几种选项:BytesType, UTF8Type, LexicalUUIDType, TimeUUIDType, AsciiType, 和LongType,用来定义如何按照column name来排序。实际上,就是把column name识别成为不同的类型,以此来达到灵活排序的目的。UTF8Type是把column name转换为UTF8编码来进行排序,LongType转换成为64位long型,TimeUUIDType是按照基于时间的UUID来排序。例如:
Column name按照LongType排序:
{name: 3, value: “jacky”},
{name: 123, value: “hellodba”},
{name: 976, value: “Cassandra”},
{name: 832416, value: “bigtable”}
Column name按照UTF8Type排序:
{name: 123, value: “hellodba”},
{name: 3, value: “jacky”},
{name: 832416, value: “bigtable”}
{name: 976, value: “Cassandra”}
下面我们看twitter的Schema:
<Keyspace>
<ColumnFamily CompareWith=”UTF8Type” />
<ColumnFamily CompareWith=”UTF8Type” />
<ColumnFamily CompareWith=”UTF8Type”
CompareSubcolumnsWith=”TimeUUIDType” ColumnType=”Super” />
<ColumnFamily CompareWith=”UTF8Type” />
<ColumnFamily CompareWith=”UTF8Type”
CompareSubcolumnsWith=”TimeUUIDType” ColumnType=”Super” />
</Keyspace>
我们看到一个叫Twitter的keyspace,包含若干个CF,其中StatusRelationships和UserRelationships被定义为包含Super column的CF,CompareWith定义了column的排序规则,CompareSubcolumnsWith定义了subcolumn的排序规则,这里使用了两种:TimeUUIDType和UTF8Type。我们没有看到任何有关column的定义,这意味着column是可以灵活变更的。
Cassandra的存储机制
值得说一下的是Cassandra的存储机制,也是借鉴了Bigtable的设计,采用Memtable和SSTable的方式。和关系数据库一样,Cassandra在写数据之前,也需要先记录日志,称之为commitlog,然后数据才会写入到Column Family对应的Memtable中,并且Memtable中的内容是按照key排序好的。Memtable是一种内存结构,满足一定条件后批量刷新到磁盘上,存储为SSTable。这种机制,相当于缓存写回机制(Write-back Cache),优势在于将随机IO写变成顺序IO写,降低大量的写操作对于存储系统的压力。SSTable一旦完成写入,就不可变更,只能读取。下一次Memtable需要刷新到一个新的SSTable文件中。所以对于Cassandra来说,可以认为只有顺序写,没有随机写操作。
Memtables are flushed to disk when:
Out of space
Too many keys (128 is default)
Time duration (client provided – no cluster clock)
When a commit log has had all its column families pushed to disk, it is deleted
因为SSTable数据不可更新,可能导致同一个Column Family的数据存储在多个SSTable中,这时查询数据时,需要去合并读取Column Family所有的SSTable和Memtable,这样到一个Column Family的数量很大的时候,可能导致查询效率严重下降。因此需要有一种机制能快速定位查询的Key落在哪些SSTable中,而不需要去读取合并所有的SSTable。Cassandra采用的是Bloom Filter算法,通过多个hash函数将key映射到一个位图中,来快速判断这个key属于哪个SSTable。
为了避免大量SSTable带来的性能影响,Cassandra也提供一种定期将多个SSTable合并成一个新的SSTable的机制,因为每个SSTable中的key都是已经排序好的,因此只需要做一次合并排序就可以完成该任务,代价还是可以接受的。合并过程中做的一些工作:
Merge keys
Combine columns
Discard tombstones
所以在Cassandra的数据存储目录中,可以看到三种类型的文件,格式类似于:
Column Family Name-序号-Data.db (key/value string pairs, sorted by keys)
Column Family Name-序号-Filter.db (all keys in data file)
Column Family Name-序号-index.db ((Key, offset) pairs (points into data file))
其中Data.db文件是SSTable数据文件,SSTable是Sorted Strings Table的缩写,按照key排序后存储key/value键值字符串。index.db是索引文件,保存的是每个key在数据文件中的偏移位置,而Filter.db则是Bloom Filter算法生产的映射文件。
ClientDemo
| Relational | SELECT `column` FROM `database`.`table` WHERE `id` = key; |
| BigTable | table.get(key, “column_family:column”) |
| Cassandra: standard model | keyspace.get(“column_family”, key, “column”) |
| Cassandra: super column model | keyspace.get(“column_family”, key, “super_column”, “column”) |
TSocket socket = new TSocket(“192.168.0.23″, 9160);
TTransport trans;
if (Boolean.valueOf(System.getProperty(“cassandra.framed”, “false”)))
trans = new TFramedTransport(socket);
else
trans = socket;
trans.open();
TProtocol protocol = new TBinaryProtocol(trans);
client = new Cassandra.Client(protocol);
ColumnPath columnPath = new ColumnPath(columnFamily, null, column.getBytes(“UTF-8″));
//增加 修改
client.insert(keySpace, key, columnPath, toBytes(value), System
.currentTimeMillis(), ConsistencyLevel.ONE);
//删除
client.remove(keySpace, key, columnPath, System.currentTimeMillis(), ConsistencyLevel.ONE);
//读取
SlicePredicate slicePredicate = new SlicePredicate();
if (columns != null) {
List<byte[]> list = new ArrayList<byte[]>();
for (String column : columns)
list.add(column.getBytes(“UTF-8″));
slicePredicate.setColumn_names(list);
} else {
slicePredicate.setSlice_range(new SliceRange(new byte[] {},new byte[] {}, false, 100));
}
List<ColumnOrSuperColumn> cols = client.get_slice(keySpace, key,
columnParent, slicePredicate, ConsistencyLevel.ONE);
for (ColumnOrSuperColumn col : cols) {
String name= new String(col.column.name, “UTF-8″);
Object value = fromBytes(col.column.value);
}
ConsistencyLevel
Write:
ZERO Ensure nothing. A write happens asynchronously in background不确保任何结果,写入操作在后台异步进行;
ONE Ensure that the write has been written to at least 1 node’s commit log and memory table before responding to the client. 确保在响应客户端之前至少写入1个节点的commit日志和内存表。
QUORUM Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes before responding to the client. 确保在响应客户端之前写入<ReplicationFactor> / 2 + 1个节点
ALL Ensure that the write is written to <ReplicationFactor> nodes before responding to the client. 确保在响应客户端之前写入<ReplicationFactor> 个节点
Read:
ZERO Not supported, because it doesn’t make sense.
ONE Will return the record returned by the first node to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ‘read repair’.) 返回第一个节点返回的数据;在后台进行一致性检查更新旧数据,下次读取该数据时将得到正确的最新结果
QUORUM Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background. 从所有节点中读取数据,返回最近的一条数据。
ALL Not yet supported, but we plan to eventually.
读写性能测试
1、 ConsistencyLevel.ONE读写。
20万随机152984ms(平均0.765),顺序130674ms (0.653),单条19ms。写入测试200万条记录980884ms(平均0.490)/961650ms(平均0.481)
2、 ConsistencyLevel. QUORUM读取
20万随机565632ms(平均2.828ms),顺序471483ms(平均2.35),单条22ms。
应用
1、 小文件存储管理。
2、 Lucandra lucene+cassandra
Lucandra is a Cassandra backend for Lucene. Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
3、 替代关系数据库。
Cassandra最大的特点就是它的可扩展性(scalability),这也就是它最大的优势。所谓的scalability,在我看来这里包括了两方面内容。一方面,可以支持极大的数据的存储,它的分布式的架构决定了只要有更多的机器,就能够保证存储更多的数据。另一方面,是指它可以支持数量很多的并发的查询。
至于缺陷就是没有事务,数据之间的关系和完备性都需要靠应用程序去维护。
Tags: cassandra
Leave a Reply
You must be logged in to post a comment.
近期评论