Apache Ignite Developers - Legacy Mail Archive

Spark SQL Table Name Resolution

Classic

List

Threaded

9 messages Options

Stuart Macdonald

Spark SQL Table Name Resolution

Hello Igniters,

The Ignite Spark SQL interface currently takes just “table name” as a
parameter which it uses to supply a Spark dataset with data from the
underlying Ignite SQL table with that name.

To do this it loops through each cache and finds the first one with the
given table name [1]. This causes issues if there are multiple tables
registered in different caches with the same table name as you can only
access one of those caches from Spark. Is the right thing to do here:

1. Simply not support such a scenario and note in the Spark documentation
that table names must be unique?
2. Pass an extra parameter through the Ignite Spark data source which
optionally specifies the cache name?
3. Support namespacing in the existing table name parameter, ie
“cacheName.tableName”?

Thanks,
Stuart.

[1]
https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be9458e29f88307/modules/spark/src/main/scala/org/apache/ignite/spark/impl/package.scala#L119

Valentin Kulichenko

Re: Spark SQL Table Name Resolution

Stuart,

Two tables can have same names only if they are located in different
schemas. Said that, sdding schema name support makes sense to me for sure.
We can implement this using either separate SCHEMA_NAME parameter, or
similar to what you suggested in option 3 but with schema name instead of
cache name.

Please feel free to create a ticket.

-Val

On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]> wrote:

> Hello Igniters,
>
> The Ignite Spark SQL interface currently takes just “table name” as a
> parameter which it uses to supply a Spark dataset with data from the
> underlying Ignite SQL table with that name.
>
> To do this it loops through each cache and finds the first one with the
> given table name [1]. This causes issues if there are multiple tables
> registered in different caches with the same table name as you can only
> access one of those caches from Spark. Is the right thing to do here:
>
> 1. Simply not support such a scenario and note in the Spark documentation
> that table names must be unique?
> 2. Pass an extra parameter through the Ignite Spark data source which
> optionally specifies the cache name?
> 3. Support namespacing in the existing table name parameter, ie
> “cacheName.tableName”?
>
> Thanks,
> Stuart.
>
> [1]
>
> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be9458e29f88307/modules/spark/src/main/scala/org/apache/ignite/spark/impl/package.scala#L119
>

Stuart Macdonald

Re: Spark SQL Table Name Resolution

Thanks Val, here’s the ticket:

https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
<https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228?filter=allopenissues>

(Thanks for correcting my terminology - I work mostly with the traditional
CacheConfiguration interface where I believe each cache occupies its own
schema.)

Stuart.

On 7 Aug 2018, at 18:34, Valentin Kulichenko <[hidden email]>
wrote:

Stuart,

Two tables can have same names only if they are located in different
schemas. Said that, sdding schema name support makes sense to me for sure.
We can implement this using either separate SCHEMA_NAME parameter, or
similar to what you suggested in option 3 but with schema name instead of
cache name.

Please feel free to create a ticket.

-Val

On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]> wrote:

Hello Igniters,

The Ignite Spark SQL interface currently takes just “table name” as a

parameter which it uses to supply a Spark dataset with data from the

underlying Ignite SQL table with that name.

To do this it loops through each cache and finds the first one with the

given table name [1]. This causes issues if there are multiple tables

registered in different caches with the same table name as you can only

access one of those caches from Spark. Is the right thing to do here:

1. Simply not support such a scenario and note in the Spark documentation

that table names must be unique?

2. Pass an extra parameter through the Ignite Spark data source which

optionally specifies the cache name?

3. Support namespacing in the existing table name parameter, ie

“cacheName.tableName”?

Thanks,

Stuart.

[1]

https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be9458e29f88307/modules/spark/src/main/scala/org/apache/ignite/spark/impl/package.scala#L119

Nikolay Izhikov-2

Re: Spark SQL Table Name Resolution

Stuart, do you want to work on this ticket?

В Вт, 07/08/2018 в 11:13 -0700, Stuart Macdonald пишет:

> Thanks Val, here’s the ticket:
>
> https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
> <https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228?filter=allopenissues>
>
> (Thanks for correcting my terminology - I work mostly with the traditional
> CacheConfiguration interface where I believe each cache occupies its own
> schema.)
>
> Stuart.
>
> On 7 Aug 2018, at 18:34, Valentin Kulichenko <[hidden email]>
> wrote:
>
> Stuart,
>
> Two tables can have same names only if they are located in different
> schemas. Said that, sdding schema name support makes sense to me for sure.
> We can implement this using either separate SCHEMA_NAME parameter, or
> similar to what you suggested in option 3 but with schema name instead of
> cache name.
>
> Please feel free to create a ticket.
>
> -Val
>
> On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]> wrote:
>
> Hello Igniters,
>
>
> The Ignite Spark SQL interface currently takes just “table name” as a
>
> parameter which it uses to supply a Spark dataset with data from the
>
> underlying Ignite SQL table with that name.
>
>
> To do this it loops through each cache and finds the first one with the
>
> given table name [1]. This causes issues if there are multiple tables
>
> registered in different caches with the same table name as you can only
>
> access one of those caches from Spark. Is the right thing to do here:
>
>
> 1. Simply not support such a scenario and note in the Spark documentation
>
> that table names must be unique?
>
> 2. Pass an extra parameter through the Ignite Spark data source which
>
> optionally specifies the cache name?
>
> 3. Support namespacing in the existing table name parameter, ie
>
> “cacheName.tableName”?
>
>
> Thanks,
>
> Stuart.
>
>
> [1]
>
>
> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be9458e29f88307/modules/spark/src/main/scala/org/apache/ignite/spark/impl/package.scala#L119

signature.asc (499 bytes) Download Attachment

Stuart Macdonald

Re: Spark SQL Table Name Resolution

Hi Nikolay, yes would be happy to - will likely be early next week.
I’ll go with the approach of adding a new optional field to the Spark
data source provider unless there are any objections.

Stuart.

> On 9 Aug 2018, at 14:20, Nikolay Izhikov <[hidden email]> wrote:
>
> Stuart, do you want to work on this ticket?
>
> В Вт, 07/08/2018 в 11:13 -0700, Stuart Macdonald пишет:
>> Thanks Val, here’s the ticket:
>>
>> https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
>> <https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228?filter=allopenissues>
>>
>> (Thanks for correcting my terminology - I work mostly with the traditional
>> CacheConfiguration interface where I believe each cache occupies its own
>> schema.)
>>
>> Stuart.
>>
>> On 7 Aug 2018, at 18:34, Valentin Kulichenko <[hidden email]>
>> wrote:
>>
>> Stuart,
>>
>> Two tables can have same names only if they are located in different
>> schemas. Said that, sdding schema name support makes sense to me for sure.
>> We can implement this using either separate SCHEMA_NAME parameter, or
>> similar to what you suggested in option 3 but with schema name instead of
>> cache name.
>>
>> Please feel free to create a ticket.
>>
>> -Val
>>
>> On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]> wrote:
>>
>> Hello Igniters,
>>
>>
>> The Ignite Spark SQL interface currently takes just “table name” as a
>>
>> parameter which it uses to supply a Spark dataset with data from the
>>
>> underlying Ignite SQL table with that name.
>>
>>
>> To do this it loops through each cache and finds the first one with the
>>
>> given table name [1]. This causes issues if there are multiple tables
>>
>> registered in different caches with the same table name as you can only
>>
>> access one of those caches from Spark. Is the right thing to do here:
>>
>>
>> 1. Simply not support such a scenario and note in the Spark documentation
>>
>> that table names must be unique?
>>
>> 2. Pass an extra parameter through the Ignite Spark data source which
>>
>> optionally specifies the cache name?
>>
>> 3. Support namespacing in the existing table name parameter, ie
>>
>> “cacheName.tableName”?
>>
>>
>> Thanks,
>>
>> Stuart.
>>
>>
>> [1]
>>
>>
>> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be9458e29f88307/modules/spark/src/main/scala/org/apache/ignite/spark/impl/package.scala#L119

Stuart Macdonald

Re: Spark SQL Table Name Resolution

Here's the initial pull request for this issue, please review and let me
know your feedback. I had to combine the two approaches to enable this to
work for both standard .read() where we can add the schema option, and
catalog-based selects where we use schemaName.tableName. Happy to discuss
on a call if this isn't clear.

https://github.com/apache/ignite/pull/4551

On Thu, Aug 9, 2018 at 2:32 PM, Stuart Macdonald <[hidden email]> wrote:

> Hi Nikolay, yes would be happy to - will likely be early next week. I’ll
> go with the approach of adding a new optional field to the Spark data
> source provider unless there are any objections.
>
> Stuart.
>
> > On 9 Aug 2018, at 14:20, Nikolay Izhikov <[hidden email]> wrote:
> >
> > Stuart, do you want to work on this ticket?
> >
> > В Вт, 07/08/2018 в 11:13 -0700, Stuart Macdonald пишет:
> >> Thanks Val, here’s the ticket:
> >>
> >> https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
> >> <https://issues.apache.org/jira/projects/IGNITE/issues/
> IGNITE-9228?filter=allopenissues>
> >>
> >> (Thanks for correcting my terminology - I work mostly with the
> traditional
> >> CacheConfiguration interface where I believe each cache occupies its own
> >> schema.)
> >>
> >> Stuart.
> >>
> >> On 7 Aug 2018, at 18:34, Valentin Kulichenko <
> [hidden email]>
> >> wrote:
> >>
> >> Stuart,
> >>
> >> Two tables can have same names only if they are located in different
> >> schemas. Said that, sdding schema name support makes sense to me for
> sure.
> >> We can implement this using either separate SCHEMA_NAME parameter, or
> >> similar to what you suggested in option 3 but with schema name instead
> of
> >> cache name.
> >>
> >> Please feel free to create a ticket.
> >>
> >> -Val
> >>
> >> On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]>
> wrote:
> >>
> >> Hello Igniters,
> >>
> >>
> >> The Ignite Spark SQL interface currently takes just “table name” as a
> >>
> >> parameter which it uses to supply a Spark dataset with data from the
> >>
> >> underlying Ignite SQL table with that name.
> >>
> >>
> >> To do this it loops through each cache and finds the first one with the
> >>
> >> given table name [1]. This causes issues if there are multiple tables
> >>
> >> registered in different caches with the same table name as you can only
> >>
> >> access one of those caches from Spark. Is the right thing to do here:
> >>
> >>
> >> 1. Simply not support such a scenario and note in the Spark
> documentation
> >>
> >> that table names must be unique?
> >>
> >> 2. Pass an extra parameter through the Ignite Spark data source which
> >>
> >> optionally specifies the cache name?
> >>
> >> 3. Support namespacing in the existing table name parameter, ie
> >>
> >> “cacheName.tableName”?
> >>
> >>
> >> Thanks,
> >>
> >> Stuart.
> >>
> >>
> >> [1]
> >>
> >>
> >> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be945
> 8e29f88307/modules/spark/src/main/scala/org/apache/ignite/
> spark/impl/package.scala#L119
>

dsetrakyan

Re: Spark SQL Table Name Resolution

Stuart, can you please move the ticket into PATCH_AVAILABLE state? You need
to click "Submit Patch" button in Jira.

D.

On Wed, Aug 15, 2018 at 10:22 AM, Stuart Macdonald <[hidden email]>
wrote:

> Here's the initial pull request for this issue, please review and let me
> know your feedback. I had to combine the two approaches to enable this to
> work for both standard .read() where we can add the schema option, and
> catalog-based selects where we use schemaName.tableName. Happy to discuss
> on a call if this isn't clear.
>
> https://github.com/apache/ignite/pull/4551
>
> On Thu, Aug 9, 2018 at 2:32 PM, Stuart Macdonald <[hidden email]>
> wrote:
>
> > Hi Nikolay, yes would be happy to - will likely be early next week. I’ll
> > go with the approach of adding a new optional field to the Spark data
> > source provider unless there are any objections.
> >
> > Stuart.
> >
> > > On 9 Aug 2018, at 14:20, Nikolay Izhikov <[hidden email]> wrote:
> > >
> > > Stuart, do you want to work on this ticket?
> > >
> > > В Вт, 07/08/2018 в 11:13 -0700, Stuart Macdonald пишет:
> > >> Thanks Val, here’s the ticket:
> > >>
> > >> https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
> > >> <https://issues.apache.org/jira/projects/IGNITE/issues/
> > IGNITE-9228?filter=allopenissues>
> > >>
> > >> (Thanks for correcting my terminology - I work mostly with the
> > traditional
> > >> CacheConfiguration interface where I believe each cache occupies its
> own
> > >> schema.)
> > >>
> > >> Stuart.
> > >>
> > >> On 7 Aug 2018, at 18:34, Valentin Kulichenko <
> > [hidden email]>
> > >> wrote:
> > >>
> > >> Stuart,
> > >>
> > >> Two tables can have same names only if they are located in different
> > >> schemas. Said that, sdding schema name support makes sense to me for
> > sure.
> > >> We can implement this using either separate SCHEMA_NAME parameter, or
> > >> similar to what you suggested in option 3 but with schema name instead
> > of
> > >> cache name.
> > >>
> > >> Please feel free to create a ticket.
> > >>
> > >> -Val
> > >>
> > >> On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]>
> > wrote:
> > >>
> > >> Hello Igniters,
> > >>
> > >>
> > >> The Ignite Spark SQL interface currently takes just “table name” as a
> > >>
> > >> parameter which it uses to supply a Spark dataset with data from the
> > >>
> > >> underlying Ignite SQL table with that name.
> > >>
> > >>
> > >> To do this it loops through each cache and finds the first one with
> the
> > >>
> > >> given table name [1]. This causes issues if there are multiple tables
> > >>
> > >> registered in different caches with the same table name as you can
> only
> > >>
> > >> access one of those caches from Spark. Is the right thing to do here:
> > >>
> > >>
> > >> 1. Simply not support such a scenario and note in the Spark
> > documentation
> > >>
> > >> that table names must be unique?
> > >>
> > >> 2. Pass an extra parameter through the Ignite Spark data source which
> > >>
> > >> optionally specifies the cache name?
> > >>
> > >> 3. Support namespacing in the existing table name parameter, ie
> > >>
> > >> “cacheName.tableName”?
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Stuart.
> > >>
> > >>
> > >> [1]
> > >>
> > >>
> > >> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be945
> > 8e29f88307/modules/spark/src/main/scala/org/apache/ignite/
> > spark/impl/package.scala#L119
> >
>

Stuart Macdonald

Re: Spark SQL Table Name Resolution

Hi Dmitriy, thanks - that’s done now,

Stuart.

On 16 Aug 2018, at 22:23, Dmitriy Setrakyan <[hidden email]> wrote:

Stuart, can you please move the ticket into PATCH_AVAILABLE state? You need
to click "Submit Patch" button in Jira.

D.

On Wed, Aug 15, 2018 at 10:22 AM, Stuart Macdonald <[hidden email]>
wrote:

Dmitriy Pavlov

Re: Spark SQL Table Name Resolution

Hi Stuart,

I see review already started and Nikolay responded in GitHub.

I've added you to contributors list, so now you can assign issues to
yourself. Also, I assigned
https://issues.apache.org/jira/browse/IGNITE-9228 issue to you. The issue
could be correctly filtered by all committers. I hope you don't mind.

Sincerely,
Dmitriy Pavlov

пт, 17 авг. 2018 г. в 10:22, Stuart Macdonald <[hidden email]>:

> Hi Dmitriy, thanks - that’s done now,
>
> Stuart.
>
> On 16 Aug 2018, at 22:23, Dmitriy Setrakyan <[hidden email]> wrote:
>
> Stuart, can you please move the ticket into PATCH_AVAILABLE state? You need
> to click "Submit Patch" button in Jira.
>
> D.
>
> On Wed, Aug 15, 2018 at 10:22 AM, Stuart Macdonald <[hidden email]>
> wrote:
>
> > Here's the initial pull request for this issue, please review and let me
> > know your feedback. I had to combine the two approaches to enable this to
> > work for both standard .read() where we can add the schema option, and
> > catalog-based selects where we use schemaName.tableName. Happy to discuss
> > on a call if this isn't clear.
> >
> > https://github.com/apache/ignite/pull/4551
> >
> > On Thu, Aug 9, 2018 at 2:32 PM, Stuart Macdonald <[hidden email]>
> > wrote:
> >
> > > Hi Nikolay, yes would be happy to - will likely be early next week.
> I’ll
> > > go with the approach of adding a new optional field to the Spark data
> > > source provider unless there are any objections.
> > >
> > > Stuart.
> > >
> > > > On 9 Aug 2018, at 14:20, Nikolay Izhikov <[hidden email]>
> wrote:
> > > >
> > > > Stuart, do you want to work on this ticket?
> > > >
> > > > В Вт, 07/08/2018 в 11:13 -0700, Stuart Macdonald пишет:
> > > >> Thanks Val, here’s the ticket:
> > > >>
> > > >> https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-9228
> > > >> <https://issues.apache.org/jira/projects/IGNITE/issues/
> > > IGNITE-9228?filter=allopenissues>
> > > >>
> > > >> (Thanks for correcting my terminology - I work mostly with the
> > > traditional
> > > >> CacheConfiguration interface where I believe each cache occupies its
> > own
> > > >> schema.)
> > > >>
> > > >> Stuart.
> > > >>
> > > >> On 7 Aug 2018, at 18:34, Valentin Kulichenko <
> > > [hidden email]>
> > > >> wrote:
> > > >>
> > > >> Stuart,
> > > >>
> > > >> Two tables can have same names only if they are located in different
> > > >> schemas. Said that, sdding schema name support makes sense to me for
> > > sure.
> > > >> We can implement this using either separate SCHEMA_NAME parameter,
> or
> > > >> similar to what you suggested in option 3 but with schema name
> instead
> > > of
> > > >> cache name.
> > > >>
> > > >> Please feel free to create a ticket.
> > > >>
> > > >> -Val
> > > >>
> > > >> On Tue, Aug 7, 2018 at 9:32 AM Stuart Macdonald <[hidden email]>
> > > wrote:
> > > >>
> > > >> Hello Igniters,
> > > >>
> > > >>
> > > >> The Ignite Spark SQL interface currently takes just “table name” as
> a
> > > >>
> > > >> parameter which it uses to supply a Spark dataset with data from the
> > > >>
> > > >> underlying Ignite SQL table with that name.
> > > >>
> > > >>
> > > >> To do this it loops through each cache and finds the first one with
> > the
> > > >>
> > > >> given table name [1]. This causes issues if there are multiple
> tables
> > > >>
> > > >> registered in different caches with the same table name as you can
> > only
> > > >>
> > > >> access one of those caches from Spark. Is the right thing to do
> here:
> > > >>
> > > >>
> > > >> 1. Simply not support such a scenario and note in the Spark
> > > documentation
> > > >>
> > > >> that table names must be unique?
> > > >>
> > > >> 2. Pass an extra parameter through the Ignite Spark data source
> which
> > > >>
> > > >> optionally specifies the cache name?
> > > >>
> > > >> 3. Support namespacing in the existing table name parameter, ie
> > > >>
> > > >> “cacheName.tableName”?
> > > >>
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Stuart.
> > > >>
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > > >>
> https://github.com/apache/ignite/blob/ca973ad99c6112160a305df05be945
> > > 8e29f88307/modules/spark/src/main/scala/org/apache/ignite/
> > > spark/impl/package.scala#L119
> > >
> >
>