Rework storage format to index-organized approach

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Rework storage format to index-organized approach

Vladimir Ozerov
Igniters,

I'd like to start a discussion about new storage format for Ignite. Our
current approach is so-called *heap-organized* storage with secondary index
per partition. It has a number of drawbacks:
1) Slow scans (joins, OLAP workload) - data is writen in arbitrary manner,
so iteration over base index leads to multiple page reads and page locks
2) Slow writes in case of OLTP workload- every update touches miltiple
index and free-list pages (a kind of write amplification)
3) Duplicated PK index when SQL is enabled - our base index cannot be used
for lookups or range scans. This makes write amplification effects even
worse.

All mature RDBMS systems emply alternative format as default -
*index-organized* storage. In this case primary index leaf pages is data
pages. Rowse are sorted inside data pages. This gives:
- Blazingly fast scans (no dereference, less page reads, less evictions,
less locks)
- Fast writes in OLTP workloads when PK index column (e.g. ID) grows
monotonically (you need to *update only one page* if there are no splits)
- Slower random writes due to index fragmentation compared to heap

I propose to adopt this approach in two phases:
1) Optionally add data to leaf pages [1]. This should improve our ScanQuery
dramatically
2) Optionally has single primary index instead of per-partition index [2].
This should improve our updates and SQL scans at the cost of harder
rebalance and recovery.

Thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-7026
[2] https://issues.apache.org/jira/browse/IGNITE-7027
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

dmagda
Vladimir,

How the free lists will be affected by the indexed-organized architecture? From what I see they’re becoming optional.


Denis
 

> On Nov 27, 2017, at 12:46 PM, Vladimir Ozerov <[hidden email]> wrote:
>
> Igniters,
>
> I'd like to start a discussion about new storage format for Ignite. Our
> current approach is so-called *heap-organized* storage with secondary index
> per partition. It has a number of drawbacks:
> 1) Slow scans (joins, OLAP workload) - data is writen in arbitrary manner,
> so iteration over base index leads to multiple page reads and page locks
> 2) Slow writes in case of OLTP workload- every update touches miltiple
> index and free-list pages (a kind of write amplification)
> 3) Duplicated PK index when SQL is enabled - our base index cannot be used
> for lookups or range scans. This makes write amplification effects even
> worse.
>
> All mature RDBMS systems emply alternative format as default -
> *index-organized* storage. In this case primary index leaf pages is data
> pages. Rowse are sorted inside data pages. This gives:
> - Blazingly fast scans (no dereference, less page reads, less evictions,
> less locks)
> - Fast writes in OLTP workloads when PK index column (e.g. ID) grows
> monotonically (you need to *update only one page* if there are no splits)
> - Slower random writes due to index fragmentation compared to heap
>
> I propose to adopt this approach in two phases:
> 1) Optionally add data to leaf pages [1]. This should improve our ScanQuery
> dramatically
> 2) Optionally has single primary index instead of per-partition index [2].
> This should improve our updates and SQL scans at the cost of harder
> rebalance and recovery.
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7026
> [2] https://issues.apache.org/jira/browse/IGNITE-7027

Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

dsetrakyan
In reply to this post by Vladimir Ozerov
Vladimir,

I definitely like the overall direction. My comments are below...


On Mon, Nov 27, 2017 at 12:46 PM, Vladimir Ozerov <[hidden email]>
wrote:

>
> I propose to adopt this approach in two phases:
> 1) Optionally add data to leaf pages. This should improve our ScanQuery
> dramatically
>

 Definitely a good idea. Shouldn't it make the primary lookups faster as
well?

2) Optionally has single primary index instead of per-partition index. This
> should improve our updates and SQL scans at the cost of harder rebalance
> and recovery.
>

Can you explain why it would improve SQL updates and Scan queries?

Also, why would this approach make rebalancing slower? If we keep the index
sorted by partition, then the rebalancing process should be able to grab
any partition at any time. Do you agree?

D.
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

Vladimir Ozerov
Denis,

No, most likely free lists (or any other space management component) will
stay still. But in case of index-organized storage we will use it in less
number of scenarios.

On Tue, Nov 28, 2017 at 6:27 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Vladimir,
>
> I definitely like the overall direction. My comments are below...
>
>
> On Mon, Nov 27, 2017 at 12:46 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> >
> > I propose to adopt this approach in two phases:
> > 1) Optionally add data to leaf pages. This should improve our ScanQuery
> > dramatically
> >
>
>  Definitely a good idea. Shouldn't it make the primary lookups faster as
> well?
>
> 2) Optionally has single primary index instead of per-partition index. This
> > should improve our updates and SQL scans at the cost of harder rebalance
> > and recovery.
> >
>
> Can you explain why it would improve SQL updates and Scan queries?
>
> Also, why would this approach make rebalancing slower? If we keep the index
> sorted by partition, then the rebalancing process should be able to grab
> any partition at any time. Do you agree?
>
> D.
>
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

Vladimir Ozerov
In reply to this post by dsetrakyan
Dima,

1) Primary key lookups could become a bit faster, but no breakthrough is
expected - there will be no need to jump from B+Tree leaf to data page, but
the tree itself will be bigger, because data records will take more space
than index records. I expect parity here.

2) We should observe dramatical improvement for scans (either ScanQuery or
SqlQuery) because data will be stored sequentially within blocks. Consider
the following case - a table with 10 records which could fit to 1 data
page. In current approach (heap) these records could be located in anywhere
from 1 data block to 10 different data blocks - it all depends on update
timings and free lists. So you end up in 10 page lock/unlock cycles and up
to 10 page reads, which will drive our LRU policy mad. In case of
index-organized approach data will be stored in 1 block in the best case
(sequential PK, no fragmentation), or 2-3 blocks in case of page splits or
segmentation. Clearly, this would be a huge win in terms of locks, page
reads and IO for scan workloads.

3) DML will be faster in case of sequential primary keys, e.g. (nearly)
monotonic LONG as transaction identifier. In this case data will be laid
out in a perfect sequential manner withing individual blocks, and in most
cases INSERT will lead to 1 data page update and 1 WAL record. Compare it
to 6 WAL record updates with current approach. On the other hand, random
INSERTS (e.g. UUID key) could become slower due to page splits and
fragmentation. Heap-organized storage is more preferable in this case.

4) Ideally we should not have index-per-partition, because in this case PK
range scans which are typical on OLAP workloads and JOINs will be slow. In
this case it would be not that easy to extract wipe out evicted partition.
This is another trade off - fast operations on stable system at the cost of
slower intermediate processes.

On Tue, Nov 28, 2017 at 6:27 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Vladimir,
>
> I definitely like the overall direction. My comments are below...
>
>
> On Mon, Nov 27, 2017 at 12:46 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> >
> > I propose to adopt this approach in two phases:
> > 1) Optionally add data to leaf pages. This should improve our ScanQuery
> > dramatically
> >
>
>  Definitely a good idea. Shouldn't it make the primary lookups faster as
> well?
>
> 2) Optionally has single primary index instead of per-partition index. This
> > should improve our updates and SQL scans at the cost of harder rebalance
> > and recovery.
> >
>
> Can you explain why it would improve SQL updates and Scan queries?
>
> Also, why would this approach make rebalancing slower? If we keep the index
> sorted by partition, then the rebalancing process should be able to grab
> any partition at any time. Do you agree?
>
> D.
>
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

Alexey Kuznetsov
In reply to this post by Vladimir Ozerov
Vova,

If we are going to rework indexes, could we also think about supporting TTL
Indexes (like in Mongo DB [1])?


[1] https://docs.mongodb.com/manual/core/index-ttl/


On Tue, Nov 28, 2017 at 3:46 AM, Vladimir Ozerov <[hidden email]>
wrote:

> Igniters,
>
> I'd like to start a discussion about new storage format for Ignite. Our
> current approach is so-called *heap-organized* storage with secondary index
> per partition. It has a number of drawbacks:
> 1) Slow scans (joins, OLAP workload) - data is writen in arbitrary manner,
> so iteration over base index leads to multiple page reads and page locks
> 2) Slow writes in case of OLTP workload- every update touches miltiple
> index and free-list pages (a kind of write amplification)
> 3) Duplicated PK index when SQL is enabled - our base index cannot be used
> for lookups or range scans. This makes write amplification effects even
> worse.
>
> All mature RDBMS systems emply alternative format as default -
> *index-organized* storage. In this case primary index leaf pages is data
> pages. Rowse are sorted inside data pages. This gives:
> - Blazingly fast scans (no dereference, less page reads, less evictions,
> less locks)
> - Fast writes in OLTP workloads when PK index column (e.g. ID) grows
> monotonically (you need to *update only one page* if there are no splits)
> - Slower random writes due to index fragmentation compared to heap
>
> I propose to adopt this approach in two phases:
> 1) Optionally add data to leaf pages [1]. This should improve our ScanQuery
> dramatically
> 2) Optionally has single primary index instead of per-partition index [2].
> This should improve our updates and SQL scans at the cost of harder
> rebalance and recovery.
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7026
> [2] https://issues.apache.org/jira/browse/IGNITE-7027
>



--
Alexey Kuznetsov
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

dsetrakyan
On Wed, Nov 29, 2017 at 12:14 AM, Alexey Kuznetsov <[hidden email]>
wrote:

>
> If we are going to rework indexes, could we also think about supporting TTL
> Indexes (like in Mongo DB [1])?
>

Awesome idea. Is there a ticket?
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

Vladimir Ozerov
Alexey,

This is completely unrelated activity.

On Wed, Nov 29, 2017 at 11:16 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Wed, Nov 29, 2017 at 12:14 AM, Alexey Kuznetsov <[hidden email]>
> wrote:
>
> >
> > If we are going to rework indexes, could we also think about supporting
> TTL
> > Indexes (like in Mongo DB [1])?
> >
>
> Awesome idea. Is there a ticket?
>
Reply | Threaded
Open this post in threaded view
|

Re: Rework storage format to index-organized approach

dsetrakyan
On Wed, Nov 29, 2017 at 12:33 AM, Vladimir Ozerov <[hidden email]>
wrote:

> Alexey,
>
> This is completely unrelated activity.
>

But still is a great idea :)