Table Names in Spark Catalog

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Table Names in Spark Catalog

Stuart Macdonald
Igniters,

While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I are
discussing whether to introduce a change which may impact backwards
compatibility; Nikolay suggested we take the discussion to this list.

Ignite implements a custom Spark catalog which provides an API by which
Spark users can list the tables which are available in Ignite which can be
queried via Spark SQL. Currently that table name list includes just the
names of the tables, but IGNITE-9228 is introducing a change which allows
optional prefixing of schema names to table names to disambiguate multiple
tables with the same name in different schemas. For the "list tables" API
we therefore have two options:

1. List the tables using both their table names and schema-qualified table
names (eg. [ "myTable", "mySchema.myTable" ]) even though they are the same
underlying table. This retains backwards compatibility with users who
expect "myTable" to appear in the catalog.
2. List the tables using only their schema-qualified names. This eliminates
duplication of names in the catalog but will potentially break
compatibility with users who expect the table name in the catalog.

With either option we will allow for  Spark SQL SELECT statements to use
either table name or schema-qualified table names, this change would purely
impact the API which is used to list available tables.

Any opinions would be welcome.

Thanks,
Stuart.

[1] https://issues.apache.org/jira/browse/IGNITE-9228
[2] https://github.com/apache/ignite/pull/4551
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Stuart.

Personally, I think we should change current tables naming and return table in form of `schema.table`.

Valentin, could you share your opinion?


В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:

> Igniters,
>
> While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I are
> discussing whether to introduce a change which may impact backwards
> compatibility; Nikolay suggested we take the discussion to this list.
>
> Ignite implements a custom Spark catalog which provides an API by which
> Spark users can list the tables which are available in Ignite which can be
> queried via Spark SQL. Currently that table name list includes just the
> names of the tables, but IGNITE-9228 is introducing a change which allows
> optional prefixing of schema names to table names to disambiguate multiple
> tables with the same name in different schemas. For the "list tables" API
> we therefore have two options:
>
> 1. List the tables using both their table names and schema-qualified table
> names (eg. [ "myTable", "mySchema.myTable" ]) even though they are the same
> underlying table. This retains backwards compatibility with users who
> expect "myTable" to appear in the catalog.
> 2. List the tables using only their schema-qualified names. This eliminates
> duplication of names in the catalog but will potentially break
> compatibility with users who expect the table name in the catalog.
>
> With either option we will allow for  Spark SQL SELECT statements to use
> either table name or schema-qualified table names, this change would purely
> impact the API which is used to list available tables.
>
> Any opinions would be welcome.
>
> Thanks,
> Stuart.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9228
> [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Valentin Kulichenko
Stuart, Nikolay,

I see that the 'Table' class (returned by listTables method) has a
'database' field. Can we use this one to report schema name?

In any case, I think we should look into how this is done in data source
implementations for other databases. Any relational database has a notion
of schema, and I'm sure Spark integrations take this into account somehow.

-Val

On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]> wrote:

> Hello, Stuart.
>
> Personally, I think we should change current tables naming and return
> table in form of `schema.table`.
>
> Valentin, could you share your opinion?
>
>
> В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > Igniters,
> >
> > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I are
> > discussing whether to introduce a change which may impact backwards
> > compatibility; Nikolay suggested we take the discussion to this list.
> >
> > Ignite implements a custom Spark catalog which provides an API by which
> > Spark users can list the tables which are available in Ignite which can
> be
> > queried via Spark SQL. Currently that table name list includes just the
> > names of the tables, but IGNITE-9228 is introducing a change which allows
> > optional prefixing of schema names to table names to disambiguate
> multiple
> > tables with the same name in different schemas. For the "list tables" API
> > we therefore have two options:
> >
> > 1. List the tables using both their table names and schema-qualified
> table
> > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are the
> same
> > underlying table. This retains backwards compatibility with users who
> > expect "myTable" to appear in the catalog.
> > 2. List the tables using only their schema-qualified names. This
> eliminates
> > duplication of names in the catalog but will potentially break
> > compatibility with users who expect the table name in the catalog.
> >
> > With either option we will allow for  Spark SQL SELECT statements to use
> > either table name or schema-qualified table names, this change would
> purely
> > impact the API which is used to list available tables.
> >
> > Any opinions would be welcome.
> >
> > Thanks,
> > Stuart.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > [2] https://github.com/apache/ignite/pull/4551
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Stuart.

Can you do some research and find out how schema is handled in Data Frames for a regular RDBMS such as Oracle, MySQL, etc?

В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:

> Stuart, Nikolay,
>
> I see that the 'Table' class (returned by listTables method) has a 'database' field. Can we use this one to report schema name?
>
> In any case, I think we should look into how this is done in data source implementations for other databases. Any relational database has a notion of schema, and I'm sure Spark integrations take this into account somehow.
>
> -Val
>
> On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]> wrote:
> > Hello, Stuart.
> >
> > Personally, I think we should change current tables naming and return table in form of `schema.table`.
> >
> > Valentin, could you share your opinion?
> >
> >
> > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > Igniters,
> > >
> > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I are
> > > discussing whether to introduce a change which may impact backwards
> > > compatibility; Nikolay suggested we take the discussion to this list.
> > >
> > > Ignite implements a custom Spark catalog which provides an API by which
> > > Spark users can list the tables which are available in Ignite which can be
> > > queried via Spark SQL. Currently that table name list includes just the
> > > names of the tables, but IGNITE-9228 is introducing a change which allows
> > > optional prefixing of schema names to table names to disambiguate multiple
> > > tables with the same name in different schemas. For the "list tables" API
> > > we therefore have two options:
> > >
> > > 1. List the tables using both their table names and schema-qualified table
> > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are the same
> > > underlying table. This retains backwards compatibility with users who
> > > expect "myTable" to appear in the catalog.
> > > 2. List the tables using only their schema-qualified names. This eliminates
> > > duplication of names in the catalog but will potentially break
> > > compatibility with users who expect the table name in the catalog.
> > >
> > > With either option we will allow for  Spark SQL SELECT statements to use
> > > either table name or schema-qualified table names, this change would purely
> > > impact the API which is used to list available tables.
> > >
> > > Any opinions would be welcome.
> > >
> > > Thanks,
> > > Stuart.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Stuart Macdonald
Nikolay, Val,

The JDBC Spark datasource[1] -- as far as I can tell -- has no
ExternalCatalog implementation, it just uses the database specified in the
JDBC URL. So I don't believe there is any way to call listTables() or
listDatabases() for JDBC provider.

The Hive ExternalCatalog[2] makes the distinction between database and
table using the actual database and table mechanisms built into the
catalog, which is fine because Hive has the clear distinction and hierarchy
of databases and tables.

*However* Ignite already uses the "database" concept in the Ignite
ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
have instances containing schemas containing tables, and Spark only has the
concept of databases and tables so it seems like either we ignore one of
the three Ignite concepts or combine two of them into database or table.
The current implementation in the pull request combines Ignite schema and
table attributes into the Spark table attribute.

Stuart.

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
[2]
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
[3]
https://github.com/apache/ignite/blob/master/modules/spark/src/main/scala/org/apache/spark/sql/ignite/IgniteExternalCatalog.scala

On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <[hidden email]>
wrote:

> Hello, Stuart.
>
> Can you do some research and find out how schema is handled in Data Frames
> for a regular RDBMS such as Oracle, MySQL, etc?
>
> В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > Stuart, Nikolay,
> >
> > I see that the 'Table' class (returned by listTables method) has a
> 'database' field. Can we use this one to report schema name?
> >
> > In any case, I think we should look into how this is done in data source
> implementations for other databases. Any relational database has a notion
> of schema, and I'm sure Spark integrations take this into account somehow.
> >
> > -Val
> >
> > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]>
> wrote:
> > > Hello, Stuart.
> > >
> > > Personally, I think we should change current tables naming and return
> table in form of `schema.table`.
> > >
> > > Valentin, could you share your opinion?
> > >
> > >
> > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > Igniters,
> > > >
> > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I are
> > > > discussing whether to introduce a change which may impact backwards
> > > > compatibility; Nikolay suggested we take the discussion to this list.
> > > >
> > > > Ignite implements a custom Spark catalog which provides an API by
> which
> > > > Spark users can list the tables which are available in Ignite which
> can be
> > > > queried via Spark SQL. Currently that table name list includes just
> the
> > > > names of the tables, but IGNITE-9228 is introducing a change which
> allows
> > > > optional prefixing of schema names to table names to disambiguate
> multiple
> > > > tables with the same name in different schemas. For the "list
> tables" API
> > > > we therefore have two options:
> > > >
> > > > 1. List the tables using both their table names and schema-qualified
> table
> > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
> the same
> > > > underlying table. This retains backwards compatibility with users who
> > > > expect "myTable" to appear in the catalog.
> > > > 2. List the tables using only their schema-qualified names. This
> eliminates
> > > > duplication of names in the catalog but will potentially break
> > > > compatibility with users who expect the table name in the catalog.
> > > >
> > > > With either option we will allow for  Spark SQL SELECT statements to
> use
> > > > either table name or schema-qualified table names, this change would
> purely
> > > > impact the API which is used to list available tables.
> > > >
> > > > Any opinions would be welcome.
> > > >
> > > > Thanks,
> > > > Stuart.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > [2] https://github.com/apache/ignite/pull/4551
>
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Valentin Kulichenko
Stuart,

Thanks for pointing this out, I was not aware that we use Spark database
concept this way. Actually, this confuses me a lot. As far as I understand,
catalog is created in the scope of a particular IgniteSparkSession, which
in turn is assigned to a particular IgniteContext and therefore single
Ignite client. If that's the case, I don't think it should be aware of
other Ignite clients that are connected to other clusters. This doesn't
look like correct behavior to me, not to mention that with this approach
having multiple databases would be a very rare case. I believe we should
get rid of this logic and use Ignite schema name as database name in
Spark's catalog.

Nikolay, what do you think?

-Val

On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <[hidden email]> wrote:

> Nikolay, Val,
>
> The JDBC Spark datasource[1] -- as far as I can tell -- has no
> ExternalCatalog implementation, it just uses the database specified in the
> JDBC URL. So I don't believe there is any way to call listTables() or
> listDatabases() for JDBC provider.
>
> The Hive ExternalCatalog[2] makes the distinction between database and
> table using the actual database and table mechanisms built into the
> catalog, which is fine because Hive has the clear distinction and hierarchy
> of databases and tables.
>
> *However* Ignite already uses the "database" concept in the Ignite
> ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
> have instances containing schemas containing tables, and Spark only has the
> concept of databases and tables so it seems like either we ignore one of
> the three Ignite concepts or combine two of them into database or table.
> The current implementation in the pull request combines Ignite schema and
> table attributes into the Spark table attribute.
>
> Stuart.
>
> [1]
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> [2]
>
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> [3]
>
> https://github.com/apache/ignite/blob/master/modules/spark/src/main/scala/org/apache/spark/sql/ignite/IgniteExternalCatalog.scala
>
> On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <[hidden email]>
> wrote:
>
> > Hello, Stuart.
> >
> > Can you do some research and find out how schema is handled in Data
> Frames
> > for a regular RDBMS such as Oracle, MySQL, etc?
> >
> > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > Stuart, Nikolay,
> > >
> > > I see that the 'Table' class (returned by listTables method) has a
> > 'database' field. Can we use this one to report schema name?
> > >
> > > In any case, I think we should look into how this is done in data
> source
> > implementations for other databases. Any relational database has a notion
> > of schema, and I'm sure Spark integrations take this into account
> somehow.
> > >
> > > -Val
> > >
> > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]>
> > wrote:
> > > > Hello, Stuart.
> > > >
> > > > Personally, I think we should change current tables naming and return
> > table in form of `schema.table`.
> > > >
> > > > Valentin, could you share your opinion?
> > > >
> > > >
> > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > Igniters,
> > > > >
> > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I
> are
> > > > > discussing whether to introduce a change which may impact backwards
> > > > > compatibility; Nikolay suggested we take the discussion to this
> list.
> > > > >
> > > > > Ignite implements a custom Spark catalog which provides an API by
> > which
> > > > > Spark users can list the tables which are available in Ignite which
> > can be
> > > > > queried via Spark SQL. Currently that table name list includes just
> > the
> > > > > names of the tables, but IGNITE-9228 is introducing a change which
> > allows
> > > > > optional prefixing of schema names to table names to disambiguate
> > multiple
> > > > > tables with the same name in different schemas. For the "list
> > tables" API
> > > > > we therefore have two options:
> > > > >
> > > > > 1. List the tables using both their table names and
> schema-qualified
> > table
> > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
> > the same
> > > > > underlying table. This retains backwards compatibility with users
> who
> > > > > expect "myTable" to appear in the catalog.
> > > > > 2. List the tables using only their schema-qualified names. This
> > eliminates
> > > > > duplication of names in the catalog but will potentially break
> > > > > compatibility with users who expect the table name in the catalog.
> > > > >
> > > > > With either option we will allow for  Spark SQL SELECT statements
> to
> > use
> > > > > either table name or schema-qualified table names, this change
> would
> > purely
> > > > > impact the API which is used to list available tables.
> > > > >
> > > > > Any opinions would be welcome.
> > > > >
> > > > > Thanks,
> > > > > Stuart.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > [2] https://github.com/apache/ignite/pull/4551
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Stuart Macdonald
Hi Val, yes that's correct. I'd be happy to make the change to have the
database reference the schema if Nikolay agrees. (I'll first need to do a
bit of research into how to obtain the list of all available schemata...)

Thanks,
Stuart.

On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
[hidden email]> wrote:

> Stuart,
>
> Thanks for pointing this out, I was not aware that we use Spark database
> concept this way. Actually, this confuses me a lot. As far as I understand,
> catalog is created in the scope of a particular IgniteSparkSession, which
> in turn is assigned to a particular IgniteContext and therefore single
> Ignite client. If that's the case, I don't think it should be aware of
> other Ignite clients that are connected to other clusters. This doesn't
> look like correct behavior to me, not to mention that with this approach
> having multiple databases would be a very rare case. I believe we should
> get rid of this logic and use Ignite schema name as database name in
> Spark's catalog.
>
> Nikolay, what do you think?
>
> -Val
>
> On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <[hidden email]>
> wrote:
>
>> Nikolay, Val,
>>
>> The JDBC Spark datasource[1] -- as far as I can tell -- has no
>> ExternalCatalog implementation, it just uses the database specified in the
>> JDBC URL. So I don't believe there is any way to call listTables() or
>> listDatabases() for JDBC provider.
>>
>> The Hive ExternalCatalog[2] makes the distinction between database and
>> table using the actual database and table mechanisms built into the
>> catalog, which is fine because Hive has the clear distinction and
>> hierarchy
>> of databases and tables.
>>
>> *However* Ignite already uses the "database" concept in the Ignite
>>
>> ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
>> have instances containing schemas containing tables, and Spark only has
>> the
>> concept of databases and tables so it seems like either we ignore one of
>> the three Ignite concepts or combine two of them into database or table.
>> The current implementation in the pull request combines Ignite schema and
>> table attributes into the Spark table attribute.
>>
>> Stuart.
>>
>> [1]
>> https://github.com/apache/spark/blob/master/sql/core/
>> src/main/scala/org/apache/spark/sql/execution/
>> datasources/jdbc/JDBCRelation.scala
>> [2]
>> https://github.com/apache/spark/blob/master/sql/hive/
>> src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
>> [3]
>> https://github.com/apache/ignite/blob/master/modules/
>> spark/src/main/scala/org/apache/spark/sql/ignite/
>> IgniteExternalCatalog.scala
>>
>> On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <[hidden email]>
>> wrote:
>>
>> > Hello, Stuart.
>> >
>> > Can you do some research and find out how schema is handled in Data
>> Frames
>> > for a regular RDBMS such as Oracle, MySQL, etc?
>> >
>> > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
>> > > Stuart, Nikolay,
>> > >
>> > > I see that the 'Table' class (returned by listTables method) has a
>> > 'database' field. Can we use this one to report schema name?
>> > >
>> > > In any case, I think we should look into how this is done in data
>> source
>> > implementations for other databases. Any relational database has a
>> notion
>> > of schema, and I'm sure Spark integrations take this into account
>> somehow.
>> > >
>> > > -Val
>> > >
>> > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]>
>> > wrote:
>> > > > Hello, Stuart.
>> > > >
>> > > > Personally, I think we should change current tables naming and
>> return
>> > table in form of `schema.table`.
>> > > >
>> > > > Valentin, could you share your opinion?
>> > > >
>> > > >
>> > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
>> > > > > Igniters,
>> > > > >
>> > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I
>> are
>> > > > > discussing whether to introduce a change which may impact
>> backwards
>> > > > > compatibility; Nikolay suggested we take the discussion to this
>> list.
>> > > > >
>> > > > > Ignite implements a custom Spark catalog which provides an API by
>> > which
>> > > > > Spark users can list the tables which are available in Ignite
>> which
>> > can be
>> > > > > queried via Spark SQL. Currently that table name list includes
>> just
>> > the
>> > > > > names of the tables, but IGNITE-9228 is introducing a change which
>> > allows
>> > > > > optional prefixing of schema names to table names to disambiguate
>> > multiple
>> > > > > tables with the same name in different schemas. For the "list
>> > tables" API
>> > > > > we therefore have two options:
>> > > > >
>> > > > > 1. List the tables using both their table names and
>> schema-qualified
>> > table
>> > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
>> > the same
>> > > > > underlying table. This retains backwards compatibility with users
>> who
>> > > > > expect "myTable" to appear in the catalog.
>> > > > > 2. List the tables using only their schema-qualified names. This
>> > eliminates
>> > > > > duplication of names in the catalog but will potentially break
>> > > > > compatibility with users who expect the table name in the catalog.
>> > > > >
>> > > > > With either option we will allow for  Spark SQL SELECT statements
>> to
>> > use
>> > > > > either table name or schema-qualified table names, this change
>> would
>> > purely
>> > > > > impact the API which is used to list available tables.
>> > > > >
>> > > > > Any opinions would be welcome.
>> > > > >
>> > > > > Thanks,
>> > > > > Stuart.
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
>> > > > > [2] https://github.com/apache/ignite/pull/4551
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Valentin.

> I believe we should get rid of this logic and use Ignite schema name as database name in Spark's catalog.

When I develop Ignite integration with Spark Data Frame I use following abstraction described by Vladimir Ozerov:

"1) Let's consider Ignite cluster as a single database ("catalog" in ANSI SQL'92 terms)." [1]

Am I was wrong? If yes - let's fix it.

[1] http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html

В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:

> Hi Val, yes that's correct. I'd be happy to make the change to have the
> database reference the schema if Nikolay agrees. (I'll first need to do a
> bit of research into how to obtain the list of all available schemata...)
>
> Thanks,
> Stuart.
>
> On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Stuart,
> >
> > Thanks for pointing this out, I was not aware that we use Spark database
> > concept this way. Actually, this confuses me a lot. As far as I understand,
> > catalog is created in the scope of a particular IgniteSparkSession, which
> > in turn is assigned to a particular IgniteContext and therefore single
> > Ignite client. If that's the case, I don't think it should be aware of
> > other Ignite clients that are connected to other clusters. This doesn't
> > look like correct behavior to me, not to mention that with this approach
> > having multiple databases would be a very rare case. I believe we should
> > get rid of this logic and use Ignite schema name as database name in
> > Spark's catalog.
> >
> > Nikolay, what do you think?
> >
> > -Val
> >
> > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <[hidden email]>
> > wrote:
> >
> > > Nikolay, Val,
> > >
> > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > ExternalCatalog implementation, it just uses the database specified in the
> > > JDBC URL. So I don't believe there is any way to call listTables() or
> > > listDatabases() for JDBC provider.
> > >
> > > The Hive ExternalCatalog[2] makes the distinction between database and
> > > table using the actual database and table mechanisms built into the
> > > catalog, which is fine because Hive has the clear distinction and
> > > hierarchy
> > > of databases and tables.
> > >
> > > *However* Ignite already uses the "database" concept in the Ignite
> > >
> > > ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
> > > have instances containing schemas containing tables, and Spark only has
> > > the
> > > concept of databases and tables so it seems like either we ignore one of
> > > the three Ignite concepts or combine two of them into database or table.
> > > The current implementation in the pull request combines Ignite schema and
> > > table attributes into the Spark table attribute.
> > >
> > > Stuart.
> > >
> > > [1]
> > > https://github.com/apache/spark/blob/master/sql/core/
> > > src/main/scala/org/apache/spark/sql/execution/
> > > datasources/jdbc/JDBCRelation.scala
> > > [2]
> > > https://github.com/apache/spark/blob/master/sql/hive/
> > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > [3]
> > > https://github.com/apache/ignite/blob/master/modules/
> > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > IgniteExternalCatalog.scala
> > >
> > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <[hidden email]>
> > > wrote:
> > >
> > > > Hello, Stuart.
> > > >
> > > > Can you do some research and find out how schema is handled in Data
> > >
> > > Frames
> > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > >
> > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > Stuart, Nikolay,
> > > > >
> > > > > I see that the 'Table' class (returned by listTables method) has a
> > > >
> > > > 'database' field. Can we use this one to report schema name?
> > > > >
> > > > > In any case, I think we should look into how this is done in data
> > >
> > > source
> > > > implementations for other databases. Any relational database has a
> > >
> > > notion
> > > > of schema, and I'm sure Spark integrations take this into account
> > >
> > > somehow.
> > > > >
> > > > > -Val
> > > > >
> > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <[hidden email]>
> > > >
> > > > wrote:
> > > > > > Hello, Stuart.
> > > > > >
> > > > > > Personally, I think we should change current tables naming and
> > >
> > > return
> > > > table in form of `schema.table`.
> > > > > >
> > > > > > Valentin, could you share your opinion?
> > > > > >
> > > > > >
> > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > Igniters,
> > > > > > >
> > > > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I
> > >
> > > are
> > > > > > > discussing whether to introduce a change which may impact
> > >
> > > backwards
> > > > > > > compatibility; Nikolay suggested we take the discussion to this
> > >
> > > list.
> > > > > > >
> > > > > > > Ignite implements a custom Spark catalog which provides an API by
> > > >
> > > > which
> > > > > > > Spark users can list the tables which are available in Ignite
> > >
> > > which
> > > > can be
> > > > > > > queried via Spark SQL. Currently that table name list includes
> > >
> > > just
> > > > the
> > > > > > > names of the tables, but IGNITE-9228 is introducing a change which
> > > >
> > > > allows
> > > > > > > optional prefixing of schema names to table names to disambiguate
> > > >
> > > > multiple
> > > > > > > tables with the same name in different schemas. For the "list
> > > >
> > > > tables" API
> > > > > > > we therefore have two options:
> > > > > > >
> > > > > > > 1. List the tables using both their table names and
> > >
> > > schema-qualified
> > > > table
> > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
> > > >
> > > > the same
> > > > > > > underlying table. This retains backwards compatibility with users
> > >
> > > who
> > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > 2. List the tables using only their schema-qualified names. This
> > > >
> > > > eliminates
> > > > > > > duplication of names in the catalog but will potentially break
> > > > > > > compatibility with users who expect the table name in the catalog.
> > > > > > >
> > > > > > > With either option we will allow for  Spark SQL SELECT statements
> > >
> > > to
> > > > use
> > > > > > > either table name or schema-qualified table names, this change
> > >
> > > would
> > > > purely
> > > > > > > impact the API which is used to list available tables.
> > > > > > >
> > > > > > > Any opinions would be welcome.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Stuart.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Valentin Kulichenko
Nikolay,

Whatever we decide on would be right :) Basically, we need to answer this
question: does the catalog exist in scope of a single IgniteSparkSession
(and therefore single IgniteContext and single Ignite instance)? In other
words, in case of a rare use case when a single Spark application connects
to multiple Ignite clusters, would there be a catalog created per cluster?

If the answer is yes, current logic doesn't make sense.

-Val


On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[hidden email]> wrote:

> Hello, Valentin.
>
> > I believe we should get rid of this logic and use Ignite schema name as
> database name in Spark's catalog.
>
> When I develop Ignite integration with Spark Data Frame I use following
> abstraction described by Vladimir Ozerov:
>
> "1) Let's consider Ignite cluster as a single database ("catalog" in ANSI
> SQL'92 terms)." [1]
>
> Am I was wrong? If yes - let's fix it.
>
> [1]
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
>
> В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > Hi Val, yes that's correct. I'd be happy to make the change to have the
> > database reference the schema if Nikolay agrees. (I'll first need to do a
> > bit of research into how to obtain the list of all available schemata...)
> >
> > Thanks,
> > Stuart.
> >
> > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > [hidden email]> wrote:
> >
> > > Stuart,
> > >
> > > Thanks for pointing this out, I was not aware that we use Spark
> database
> > > concept this way. Actually, this confuses me a lot. As far as I
> understand,
> > > catalog is created in the scope of a particular IgniteSparkSession,
> which
> > > in turn is assigned to a particular IgniteContext and therefore single
> > > Ignite client. If that's the case, I don't think it should be aware of
> > > other Ignite clients that are connected to other clusters. This doesn't
> > > look like correct behavior to me, not to mention that with this
> approach
> > > having multiple databases would be a very rare case. I believe we
> should
> > > get rid of this logic and use Ignite schema name as database name in
> > > Spark's catalog.
> > >
> > > Nikolay, what do you think?
> > >
> > > -Val
> > >
> > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <[hidden email]>
> > > wrote:
> > >
> > > > Nikolay, Val,
> > > >
> > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > ExternalCatalog implementation, it just uses the database specified
> in the
> > > > JDBC URL. So I don't believe there is any way to call listTables() or
> > > > listDatabases() for JDBC provider.
> > > >
> > > > The Hive ExternalCatalog[2] makes the distinction between database
> and
> > > > table using the actual database and table mechanisms built into the
> > > > catalog, which is fine because Hive has the clear distinction and
> > > > hierarchy
> > > > of databases and tables.
> > > >
> > > > *However* Ignite already uses the "database" concept in the Ignite
> > > >
> > > > ExternalCatalog[3] to mean the name of an Ignite instance. So in
> Ignite we
> > > > have instances containing schemas containing tables, and Spark only
> has
> > > > the
> > > > concept of databases and tables so it seems like either we ignore
> one of
> > > > the three Ignite concepts or combine two of them into database or
> table.
> > > > The current implementation in the pull request combines Ignite
> schema and
> > > > table attributes into the Spark table attribute.
> > > >
> > > > Stuart.
> > > >
> > > > [1]
> > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > src/main/scala/org/apache/spark/sql/execution/
> > > > datasources/jdbc/JDBCRelation.scala
> > > > [2]
> > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > [3]
> > > > https://github.com/apache/ignite/blob/master/modules/
> > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > IgniteExternalCatalog.scala
> > > >
> > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello, Stuart.
> > > > >
> > > > > Can you do some research and find out how schema is handled in Data
> > > >
> > > > Frames
> > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > >
> > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > Stuart, Nikolay,
> > > > > >
> > > > > > I see that the 'Table' class (returned by listTables method) has
> a
> > > > >
> > > > > 'database' field. Can we use this one to report schema name?
> > > > > >
> > > > > > In any case, I think we should look into how this is done in data
> > > >
> > > > source
> > > > > implementations for other databases. Any relational database has a
> > > >
> > > > notion
> > > > > of schema, and I'm sure Spark integrations take this into account
> > > >
> > > > somehow.
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> [hidden email]>
> > > > >
> > > > > wrote:
> > > > > > > Hello, Stuart.
> > > > > > >
> > > > > > > Personally, I think we should change current tables naming and
> > > >
> > > > return
> > > > > table in form of `schema.table`.
> > > > > > >
> > > > > > > Valentin, could you share your opinion?
> > > > > > >
> > > > > > >
> > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > Igniters,
> > > > > > > >
> > > > > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay
> and I
> > > >
> > > > are
> > > > > > > > discussing whether to introduce a change which may impact
> > > >
> > > > backwards
> > > > > > > > compatibility; Nikolay suggested we take the discussion to
> this
> > > >
> > > > list.
> > > > > > > >
> > > > > > > > Ignite implements a custom Spark catalog which provides an
> API by
> > > > >
> > > > > which
> > > > > > > > Spark users can list the tables which are available in Ignite
> > > >
> > > > which
> > > > > can be
> > > > > > > > queried via Spark SQL. Currently that table name list
> includes
> > > >
> > > > just
> > > > > the
> > > > > > > > names of the tables, but IGNITE-9228 is introducing a change
> which
> > > > >
> > > > > allows
> > > > > > > > optional prefixing of schema names to table names to
> disambiguate
> > > > >
> > > > > multiple
> > > > > > > > tables with the same name in different schemas. For the "list
> > > > >
> > > > > tables" API
> > > > > > > > we therefore have two options:
> > > > > > > >
> > > > > > > > 1. List the tables using both their table names and
> > > >
> > > > schema-qualified
> > > > > table
> > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though
> they are
> > > > >
> > > > > the same
> > > > > > > > underlying table. This retains backwards compatibility with
> users
> > > >
> > > > who
> > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > 2. List the tables using only their schema-qualified names.
> This
> > > > >
> > > > > eliminates
> > > > > > > > duplication of names in the catalog but will potentially
> break
> > > > > > > > compatibility with users who expect the table name in the
> catalog.
> > > > > > > >
> > > > > > > > With either option we will allow for  Spark SQL SELECT
> statements
> > > >
> > > > to
> > > > > use
> > > > > > > > either table name or schema-qualified table names, this
> change
> > > >
> > > > would
> > > > > purely
> > > > > > > > impact the API which is used to list available tables.
> > > > > > > >
> > > > > > > > Any opinions would be welcome.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Stuart.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > [2] https://github.com/apache/ignite/pull/4551
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Valentin.

> catalog exist in scope of a single IgniteSparkSession> (and therefore single IgniteContext and single Ignite instance)?

Yes.
Actually, I was thinking about use case when we have several Ignite configuration in one XML file.
Now I see, may be this is too rare use-case to support.

Stuart, Valentin, What is your proposal?

В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:

> Nikolay,
>
> Whatever we decide on would be right :) Basically, we need to answer this
> question: does the catalog exist in scope of a single IgniteSparkSession
> (and therefore single IgniteContext and single Ignite instance)? In other
> words, in case of a rare use case when a single Spark application connects
> to multiple Ignite clusters, would there be a catalog created per cluster?
>
> If the answer is yes, current logic doesn't make sense.
>
> -Val
>
>
> On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[hidden email]> wrote:
>
> > Hello, Valentin.
> >
> > > I believe we should get rid of this logic and use Ignite schema name as
> >
> > database name in Spark's catalog.
> >
> > When I develop Ignite integration with Spark Data Frame I use following
> > abstraction described by Vladimir Ozerov:
> >
> > "1) Let's consider Ignite cluster as a single database ("catalog" in ANSI
> > SQL'92 terms)." [1]
> >
> > Am I was wrong? If yes - let's fix it.
> >
> > [1]
> > http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> >
> > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > Hi Val, yes that's correct. I'd be happy to make the change to have the
> > > database reference the schema if Nikolay agrees. (I'll first need to do a
> > > bit of research into how to obtain the list of all available schemata...)
> > >
> > > Thanks,
> > > Stuart.
> > >
> > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > [hidden email]> wrote:
> > >
> > > > Stuart,
> > > >
> > > > Thanks for pointing this out, I was not aware that we use Spark
> >
> > database
> > > > concept this way. Actually, this confuses me a lot. As far as I
> >
> > understand,
> > > > catalog is created in the scope of a particular IgniteSparkSession,
> >
> > which
> > > > in turn is assigned to a particular IgniteContext and therefore single
> > > > Ignite client. If that's the case, I don't think it should be aware of
> > > > other Ignite clients that are connected to other clusters. This doesn't
> > > > look like correct behavior to me, not to mention that with this
> >
> > approach
> > > > having multiple databases would be a very rare case. I believe we
> >
> > should
> > > > get rid of this logic and use Ignite schema name as database name in
> > > > Spark's catalog.
> > > >
> > > > Nikolay, what do you think?
> > > >
> > > > -Val
> > > >
> > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <[hidden email]>
> > > > wrote:
> > > >
> > > > > Nikolay, Val,
> > > > >
> > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > ExternalCatalog implementation, it just uses the database specified
> >
> > in the
> > > > > JDBC URL. So I don't believe there is any way to call listTables() or
> > > > > listDatabases() for JDBC provider.
> > > > >
> > > > > The Hive ExternalCatalog[2] makes the distinction between database
> >
> > and
> > > > > table using the actual database and table mechanisms built into the
> > > > > catalog, which is fine because Hive has the clear distinction and
> > > > > hierarchy
> > > > > of databases and tables.
> > > > >
> > > > > *However* Ignite already uses the "database" concept in the Ignite
> > > > >
> > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So in
> >
> > Ignite we
> > > > > have instances containing schemas containing tables, and Spark only
> >
> > has
> > > > > the
> > > > > concept of databases and tables so it seems like either we ignore
> >
> > one of
> > > > > the three Ignite concepts or combine two of them into database or
> >
> > table.
> > > > > The current implementation in the pull request combines Ignite
> >
> > schema and
> > > > > table attributes into the Spark table attribute.
> > > > >
> > > > > Stuart.
> > > > >
> > > > > [1]
> > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > datasources/jdbc/JDBCRelation.scala
> > > > > [2]
> > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > [3]
> > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > IgniteExternalCatalog.scala
> > > > >
> > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> >
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hello, Stuart.
> > > > > >
> > > > > > Can you do some research and find out how schema is handled in Data
> > > > >
> > > > > Frames
> > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > >
> > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > Stuart, Nikolay,
> > > > > > >
> > > > > > > I see that the 'Table' class (returned by listTables method) has
> >
> > a
> > > > > >
> > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > >
> > > > > > > In any case, I think we should look into how this is done in data
> > > > >
> > > > > source
> > > > > > implementations for other databases. Any relational database has a
> > > > >
> > > > > notion
> > > > > > of schema, and I'm sure Spark integrations take this into account
> > > > >
> > > > > somehow.
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> >
> > [hidden email]>
> > > > > >
> > > > > > wrote:
> > > > > > > > Hello, Stuart.
> > > > > > > >
> > > > > > > > Personally, I think we should change current tables naming and
> > > > >
> > > > > return
> > > > > > table in form of `schema.table`.
> > > > > > > >
> > > > > > > > Valentin, could you share your opinion?
> > > > > > > >
> > > > > > > >
> > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay
> >
> > and I
> > > > >
> > > > > are
> > > > > > > > > discussing whether to introduce a change which may impact
> > > > >
> > > > > backwards
> > > > > > > > > compatibility; Nikolay suggested we take the discussion to
> >
> > this
> > > > >
> > > > > list.
> > > > > > > > >
> > > > > > > > > Ignite implements a custom Spark catalog which provides an
> >
> > API by
> > > > > >
> > > > > > which
> > > > > > > > > Spark users can list the tables which are available in Ignite
> > > > >
> > > > > which
> > > > > > can be
> > > > > > > > > queried via Spark SQL. Currently that table name list
> >
> > includes
> > > > >
> > > > > just
> > > > > > the
> > > > > > > > > names of the tables, but IGNITE-9228 is introducing a change
> >
> > which
> > > > > >
> > > > > > allows
> > > > > > > > > optional prefixing of schema names to table names to
> >
> > disambiguate
> > > > > >
> > > > > > multiple
> > > > > > > > > tables with the same name in different schemas. For the "list
> > > > > >
> > > > > > tables" API
> > > > > > > > > we therefore have two options:
> > > > > > > > >
> > > > > > > > > 1. List the tables using both their table names and
> > > > >
> > > > > schema-qualified
> > > > > > table
> > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though
> >
> > they are
> > > > > >
> > > > > > the same
> > > > > > > > > underlying table. This retains backwards compatibility with
> >
> > users
> > > > >
> > > > > who
> > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > 2. List the tables using only their schema-qualified names.
> >
> > This
> > > > > >
> > > > > > eliminates
> > > > > > > > > duplication of names in the catalog but will potentially
> >
> > break
> > > > > > > > > compatibility with users who expect the table name in the
> >
> > catalog.
> > > > > > > > >
> > > > > > > > > With either option we will allow for  Spark SQL SELECT
> >
> > statements
> > > > >
> > > > > to
> > > > > > use
> > > > > > > > > either table name or schema-qualified table names, this
> >
> > change
> > > > >
> > > > > would
> > > > > > purely
> > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > >
> > > > > > > > > Any opinions would be welcome.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Stuart.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Valentin Kulichenko
Nikolay,

If there are multiple configuration in XML, IgniteContext will always use
only one of them. Looks like current approach simply doesn't work. I
propose to report schema name as 'database' in Spark. If there are multiple
clients, you would create multiple sessions and multiple catalogs.

Makes sense?

-Val

On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[hidden email]>
wrote:

> Hello, Valentin.
>
> > catalog exist in scope of a single IgniteSparkSession> (and therefore
> single IgniteContext and single Ignite instance)?
>
> Yes.
> Actually, I was thinking about use case when we have several Ignite
> configuration in one XML file.
> Now I see, may be this is too rare use-case to support.
>
> Stuart, Valentin, What is your proposal?
>
> В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > Nikolay,
> >
> > Whatever we decide on would be right :) Basically, we need to answer this
> > question: does the catalog exist in scope of a single IgniteSparkSession
> > (and therefore single IgniteContext and single Ignite instance)? In other
> > words, in case of a rare use case when a single Spark application
> connects
> > to multiple Ignite clusters, would there be a catalog created per
> cluster?
> >
> > If the answer is yes, current logic doesn't make sense.
> >
> > -Val
> >
> >
> > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[hidden email]>
> wrote:
> >
> > > Hello, Valentin.
> > >
> > > > I believe we should get rid of this logic and use Ignite schema name
> as
> > >
> > > database name in Spark's catalog.
> > >
> > > When I develop Ignite integration with Spark Data Frame I use following
> > > abstraction described by Vladimir Ozerov:
> > >
> > > "1) Let's consider Ignite cluster as a single database ("catalog" in
> ANSI
> > > SQL'92 terms)." [1]
> > >
> > > Am I was wrong? If yes - let's fix it.
> > >
> > > [1]
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > >
> > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > Hi Val, yes that's correct. I'd be happy to make the change to have
> the
> > > > database reference the schema if Nikolay agrees. (I'll first need to
> do a
> > > > bit of research into how to obtain the list of all available
> schemata...)
> > > >
> > > > Thanks,
> > > > Stuart.
> > > >
> > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > [hidden email]> wrote:
> > > >
> > > > > Stuart,
> > > > >
> > > > > Thanks for pointing this out, I was not aware that we use Spark
> > >
> > > database
> > > > > concept this way. Actually, this confuses me a lot. As far as I
> > >
> > > understand,
> > > > > catalog is created in the scope of a particular IgniteSparkSession,
> > >
> > > which
> > > > > in turn is assigned to a particular IgniteContext and therefore
> single
> > > > > Ignite client. If that's the case, I don't think it should be
> aware of
> > > > > other Ignite clients that are connected to other clusters. This
> doesn't
> > > > > look like correct behavior to me, not to mention that with this
> > >
> > > approach
> > > > > having multiple databases would be a very rare case. I believe we
> > >
> > > should
> > > > > get rid of this logic and use Ignite schema name as database name
> in
> > > > > Spark's catalog.
> > > > >
> > > > > Nikolay, what do you think?
> > > > >
> > > > > -Val
> > > > >
> > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Nikolay, Val,
> > > > > >
> > > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > > ExternalCatalog implementation, it just uses the database
> specified
> > >
> > > in the
> > > > > > JDBC URL. So I don't believe there is any way to call
> listTables() or
> > > > > > listDatabases() for JDBC provider.
> > > > > >
> > > > > > The Hive ExternalCatalog[2] makes the distinction between
> database
> > >
> > > and
> > > > > > table using the actual database and table mechanisms built into
> the
> > > > > > catalog, which is fine because Hive has the clear distinction and
> > > > > > hierarchy
> > > > > > of databases and tables.
> > > > > >
> > > > > > *However* Ignite already uses the "database" concept in the
> Ignite
> > > > > >
> > > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So in
> > >
> > > Ignite we
> > > > > > have instances containing schemas containing tables, and Spark
> only
> > >
> > > has
> > > > > > the
> > > > > > concept of databases and tables so it seems like either we ignore
> > >
> > > one of
> > > > > > the three Ignite concepts or combine two of them into database or
> > >
> > > table.
> > > > > > The current implementation in the pull request combines Ignite
> > >
> > > schema and
> > > > > > table attributes into the Spark table attribute.
> > > > > >
> > > > > > Stuart.
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > [2]
> > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > >
> src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > [3]
> > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > IgniteExternalCatalog.scala
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > >
> > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello, Stuart.
> > > > > > >
> > > > > > > Can you do some research and find out how schema is handled in
> Data
> > > > > >
> > > > > > Frames
> > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > >
> > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > > Stuart, Nikolay,
> > > > > > > >
> > > > > > > > I see that the 'Table' class (returned by listTables method)
> has
> > >
> > > a
> > > > > > >
> > > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > > >
> > > > > > > > In any case, I think we should look into how this is done in
> data
> > > > > >
> > > > > > source
> > > > > > > implementations for other databases. Any relational database
> has a
> > > > > >
> > > > > > notion
> > > > > > > of schema, and I'm sure Spark integrations take this into
> account
> > > > > >
> > > > > > somehow.
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > >
> > > [hidden email]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > Hello, Stuart.
> > > > > > > > >
> > > > > > > > > Personally, I think we should change current tables naming
> and
> > > > > >
> > > > > > return
> > > > > > > table in form of `schema.table`.
> > > > > > > > >
> > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > > Igniters,
> > > > > > > > > >
> > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> Nikolay
> > >
> > > and I
> > > > > >
> > > > > > are
> > > > > > > > > > discussing whether to introduce a change which may impact
> > > > > >
> > > > > > backwards
> > > > > > > > > > compatibility; Nikolay suggested we take the discussion
> to
> > >
> > > this
> > > > > >
> > > > > > list.
> > > > > > > > > >
> > > > > > > > > > Ignite implements a custom Spark catalog which provides
> an
> > >
> > > API by
> > > > > > >
> > > > > > > which
> > > > > > > > > > Spark users can list the tables which are available in
> Ignite
> > > > > >
> > > > > > which
> > > > > > > can be
> > > > > > > > > > queried via Spark SQL. Currently that table name list
> > >
> > > includes
> > > > > >
> > > > > > just
> > > > > > > the
> > > > > > > > > > names of the tables, but IGNITE-9228 is introducing a
> change
> > >
> > > which
> > > > > > >
> > > > > > > allows
> > > > > > > > > > optional prefixing of schema names to table names to
> > >
> > > disambiguate
> > > > > > >
> > > > > > > multiple
> > > > > > > > > > tables with the same name in different schemas. For the
> "list
> > > > > > >
> > > > > > > tables" API
> > > > > > > > > > we therefore have two options:
> > > > > > > > > >
> > > > > > > > > > 1. List the tables using both their table names and
> > > > > >
> > > > > > schema-qualified
> > > > > > > table
> > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though
> > >
> > > they are
> > > > > > >
> > > > > > > the same
> > > > > > > > > > underlying table. This retains backwards compatibility
> with
> > >
> > > users
> > > > > >
> > > > > > who
> > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > 2. List the tables using only their schema-qualified
> names.
> > >
> > > This
> > > > > > >
> > > > > > > eliminates
> > > > > > > > > > duplication of names in the catalog but will potentially
> > >
> > > break
> > > > > > > > > > compatibility with users who expect the table name in the
> > >
> > > catalog.
> > > > > > > > > >
> > > > > > > > > > With either option we will allow for  Spark SQL SELECT
> > >
> > > statements
> > > > > >
> > > > > > to
> > > > > > > use
> > > > > > > > > > either table name or schema-qualified table names, this
> > >
> > > change
> > > > > >
> > > > > > would
> > > > > > > purely
> > > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > > >
> > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Stuart.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Stuart Macdonald
I'll go ahead and make the changes to represent the schema name as the
database name for the purposes of the Spark catalog.

If anyone knows of an existing way to list all available schemata within an
Ignite instance please let me know, otherwise the first task will be
creating that mechanism.

Stuart.

On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
[hidden email]> wrote:

> Nikolay,
>
> If there are multiple configuration in XML, IgniteContext will always use
> only one of them. Looks like current approach simply doesn't work. I
> propose to report schema name as 'database' in Spark. If there are multiple
> clients, you would create multiple sessions and multiple catalogs.
>
> Makes sense?
>
> -Val
>
> On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[hidden email]>
> wrote:
>
> > Hello, Valentin.
> >
> > > catalog exist in scope of a single IgniteSparkSession> (and therefore
> > single IgniteContext and single Ignite instance)?
> >
> > Yes.
> > Actually, I was thinking about use case when we have several Ignite
> > configuration in one XML file.
> > Now I see, may be this is too rare use-case to support.
> >
> > Stuart, Valentin, What is your proposal?
> >
> > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > Nikolay,
> > >
> > > Whatever we decide on would be right :) Basically, we need to answer
> this
> > > question: does the catalog exist in scope of a single
> IgniteSparkSession
> > > (and therefore single IgniteContext and single Ignite instance)? In
> other
> > > words, in case of a rare use case when a single Spark application
> > connects
> > > to multiple Ignite clusters, would there be a catalog created per
> > cluster?
> > >
> > > If the answer is yes, current logic doesn't make sense.
> > >
> > > -Val
> > >
> > >
> > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[hidden email]>
> > wrote:
> > >
> > > > Hello, Valentin.
> > > >
> > > > > I believe we should get rid of this logic and use Ignite schema
> name
> > as
> > > >
> > > > database name in Spark's catalog.
> > > >
> > > > When I develop Ignite integration with Spark Data Frame I use
> following
> > > > abstraction described by Vladimir Ozerov:
> > > >
> > > > "1) Let's consider Ignite cluster as a single database ("catalog" in
> > ANSI
> > > > SQL'92 terms)." [1]
> > > >
> > > > Am I was wrong? If yes - let's fix it.
> > > >
> > > > [1]
> > > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > >
> > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > Hi Val, yes that's correct. I'd be happy to make the change to have
> > the
> > > > > database reference the schema if Nikolay agrees. (I'll first need
> to
> > do a
> > > > > bit of research into how to obtain the list of all available
> > schemata...)
> > > > >
> > > > > Thanks,
> > > > > Stuart.
> > > > >
> > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > Stuart,
> > > > > >
> > > > > > Thanks for pointing this out, I was not aware that we use Spark
> > > >
> > > > database
> > > > > > concept this way. Actually, this confuses me a lot. As far as I
> > > >
> > > > understand,
> > > > > > catalog is created in the scope of a particular
> IgniteSparkSession,
> > > >
> > > > which
> > > > > > in turn is assigned to a particular IgniteContext and therefore
> > single
> > > > > > Ignite client. If that's the case, I don't think it should be
> > aware of
> > > > > > other Ignite clients that are connected to other clusters. This
> > doesn't
> > > > > > look like correct behavior to me, not to mention that with this
> > > >
> > > > approach
> > > > > > having multiple databases would be a very rare case. I believe we
> > > >
> > > > should
> > > > > > get rid of this logic and use Ignite schema name as database name
> > in
> > > > > > Spark's catalog.
> > > > > >
> > > > > > Nikolay, what do you think?
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Nikolay, Val,
> > > > > > >
> > > > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > > > ExternalCatalog implementation, it just uses the database
> > specified
> > > >
> > > > in the
> > > > > > > JDBC URL. So I don't believe there is any way to call
> > listTables() or
> > > > > > > listDatabases() for JDBC provider.
> > > > > > >
> > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > database
> > > >
> > > > and
> > > > > > > table using the actual database and table mechanisms built into
> > the
> > > > > > > catalog, which is fine because Hive has the clear distinction
> and
> > > > > > > hierarchy
> > > > > > > of databases and tables.
> > > > > > >
> > > > > > > *However* Ignite already uses the "database" concept in the
> > Ignite
> > > > > > >
> > > > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So
> in
> > > >
> > > > Ignite we
> > > > > > > have instances containing schemas containing tables, and Spark
> > only
> > > >
> > > > has
> > > > > > > the
> > > > > > > concept of databases and tables so it seems like either we
> ignore
> > > >
> > > > one of
> > > > > > > the three Ignite concepts or combine two of them into database
> or
> > > >
> > > > table.
> > > > > > > The current implementation in the pull request combines Ignite
> > > >
> > > > schema and
> > > > > > > table attributes into the Spark table attribute.
> > > > > > >
> > > > > > > Stuart.
> > > > > > >
> > > > > > > [1]
> > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > [2]
> > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > >
> > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > [3]
> > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > IgniteExternalCatalog.scala
> > > > > > >
> > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > >
> > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello, Stuart.
> > > > > > > >
> > > > > > > > Can you do some research and find out how schema is handled
> in
> > Data
> > > > > > >
> > > > > > > Frames
> > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > >
> > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > > > Stuart, Nikolay,
> > > > > > > > >
> > > > > > > > > I see that the 'Table' class (returned by listTables
> method)
> > has
> > > >
> > > > a
> > > > > > > >
> > > > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > > > >
> > > > > > > > > In any case, I think we should look into how this is done
> in
> > data
> > > > > > >
> > > > > > > source
> > > > > > > > implementations for other databases. Any relational database
> > has a
> > > > > > >
> > > > > > > notion
> > > > > > > > of schema, and I'm sure Spark integrations take this into
> > account
> > > > > > >
> > > > > > > somehow.
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > >
> > > > [hidden email]>
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > Hello, Stuart.
> > > > > > > > > >
> > > > > > > > > > Personally, I think we should change current tables
> naming
> > and
> > > > > > >
> > > > > > > return
> > > > > > > > table in form of `schema.table`.
> > > > > > > > > >
> > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > > > Igniters,
> > > > > > > > > > >
> > > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> > Nikolay
> > > >
> > > > and I
> > > > > > >
> > > > > > > are
> > > > > > > > > > > discussing whether to introduce a change which may
> impact
> > > > > > >
> > > > > > > backwards
> > > > > > > > > > > compatibility; Nikolay suggested we take the discussion
> > to
> > > >
> > > > this
> > > > > > >
> > > > > > > list.
> > > > > > > > > > >
> > > > > > > > > > > Ignite implements a custom Spark catalog which provides
> > an
> > > >
> > > > API by
> > > > > > > >
> > > > > > > > which
> > > > > > > > > > > Spark users can list the tables which are available in
> > Ignite
> > > > > > >
> > > > > > > which
> > > > > > > > can be
> > > > > > > > > > > queried via Spark SQL. Currently that table name list
> > > >
> > > > includes
> > > > > > >
> > > > > > > just
> > > > > > > > the
> > > > > > > > > > > names of the tables, but IGNITE-9228 is introducing a
> > change
> > > >
> > > > which
> > > > > > > >
> > > > > > > > allows
> > > > > > > > > > > optional prefixing of schema names to table names to
> > > >
> > > > disambiguate
> > > > > > > >
> > > > > > > > multiple
> > > > > > > > > > > tables with the same name in different schemas. For the
> > "list
> > > > > > > >
> > > > > > > > tables" API
> > > > > > > > > > > we therefore have two options:
> > > > > > > > > > >
> > > > > > > > > > > 1. List the tables using both their table names and
> > > > > > >
> > > > > > > schema-qualified
> > > > > > > > table
> > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even
> though
> > > >
> > > > they are
> > > > > > > >
> > > > > > > > the same
> > > > > > > > > > > underlying table. This retains backwards compatibility
> > with
> > > >
> > > > users
> > > > > > >
> > > > > > > who
> > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > 2. List the tables using only their schema-qualified
> > names.
> > > >
> > > > This
> > > > > > > >
> > > > > > > > eliminates
> > > > > > > > > > > duplication of names in the catalog but will
> potentially
> > > >
> > > > break
> > > > > > > > > > > compatibility with users who expect the table name in
> the
> > > >
> > > > catalog.
> > > > > > > > > > >
> > > > > > > > > > > With either option we will allow for  Spark SQL SELECT
> > > >
> > > > statements
> > > > > > >
> > > > > > > to
> > > > > > > > use
> > > > > > > > > > > either table name or schema-qualified table names, this
> > > >
> > > > change
> > > > > > >
> > > > > > > would
> > > > > > > > purely
> > > > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > > > >
> > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stuart.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551
>
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Igniters,

Personally, I don't like the solution with database == schema name.

1. I think we should try to use the right abstractions.
schema == database doesn't sound right for me.

Do you want to answer to all of our users something like that:

- "How I can change Ignite SQL schema?"
- "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".

2. I think we restrict whole solution with that decision.
If Ignite will support multiple databases in the future we just don't have a place for it.

I think we should do the following:

        1. IgniteExternalCatalog should be able to return *ALL* tables within Ignite instance.
        We shouldn't restrict tables list by schema by default.
        We should return tables with schema name - `schema.table`

        2. We should introduce `OPTION_SCHEMA` for a dataframe to specify a schema.

        There is an issue with the second step: We can't use schema name in `CREATE TABLE` clause.
        This is restriction of current Ignite SQL.

        I propose to make the following:

        1. For all write modes that requires the creation of table we should disallow usage of table outside of `SQL_PUBLIC`
        or usage of `OPTION_SCHEMA`. We should throw proper exception for this case.

        2. Create a ticket to support `CREATE TABLE` with custom schema name.

        3. After resolving ticket from step 2 we can add full support of custom schema to Spark integration.

        4. We should throw an exception if user try to use setDatabase.

Is that makes sense for you?

В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:

> I'll go ahead and make the changes to represent the schema name as the
> database name for the purposes of the Spark catalog.
>
> If anyone knows of an existing way to list all available schemata within an
> Ignite instance please let me know, otherwise the first task will be
> creating that mechanism.
>
> Stuart.
>
> On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Nikolay,
> >
> > If there are multiple configuration in XML, IgniteContext will always use
> > only one of them. Looks like current approach simply doesn't work. I
> > propose to report schema name as 'database' in Spark. If there are multiple
> > clients, you would create multiple sessions and multiple catalogs.
> >
> > Makes sense?
> >
> > -Val
> >
> > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[hidden email]>
> > wrote:
> >
> > > Hello, Valentin.
> > >
> > > > catalog exist in scope of a single IgniteSparkSession> (and therefore
> > >
> > > single IgniteContext and single Ignite instance)?
> > >
> > > Yes.
> > > Actually, I was thinking about use case when we have several Ignite
> > > configuration in one XML file.
> > > Now I see, may be this is too rare use-case to support.
> > >
> > > Stuart, Valentin, What is your proposal?
> > >
> > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > Nikolay,
> > > >
> > > > Whatever we decide on would be right :) Basically, we need to answer
> >
> > this
> > > > question: does the catalog exist in scope of a single
> >
> > IgniteSparkSession
> > > > (and therefore single IgniteContext and single Ignite instance)? In
> >
> > other
> > > > words, in case of a rare use case when a single Spark application
> > >
> > > connects
> > > > to multiple Ignite clusters, would there be a catalog created per
> > >
> > > cluster?
> > > >
> > > > If the answer is yes, current logic doesn't make sense.
> > > >
> > > > -Val
> > > >
> > > >
> > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[hidden email]>
> > >
> > > wrote:
> > > >
> > > > > Hello, Valentin.
> > > > >
> > > > > > I believe we should get rid of this logic and use Ignite schema
> >
> > name
> > > as
> > > > >
> > > > > database name in Spark's catalog.
> > > > >
> > > > > When I develop Ignite integration with Spark Data Frame I use
> >
> > following
> > > > > abstraction described by Vladimir Ozerov:
> > > > >
> > > > > "1) Let's consider Ignite cluster as a single database ("catalog" in
> > >
> > > ANSI
> > > > > SQL'92 terms)." [1]
> > > > >
> > > > > Am I was wrong? If yes - let's fix it.
> > > > >
> > > > > [1]
> > > > >
> >
> > http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > >
> > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > Hi Val, yes that's correct. I'd be happy to make the change to have
> > >
> > > the
> > > > > > database reference the schema if Nikolay agrees. (I'll first need
> >
> > to
> > > do a
> > > > > > bit of research into how to obtain the list of all available
> > >
> > > schemata...)
> > > > > >
> > > > > > Thanks,
> > > > > > Stuart.
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > > > Stuart,
> > > > > > >
> > > > > > > Thanks for pointing this out, I was not aware that we use Spark
> > > > >
> > > > > database
> > > > > > > concept this way. Actually, this confuses me a lot. As far as I
> > > > >
> > > > > understand,
> > > > > > > catalog is created in the scope of a particular
> >
> > IgniteSparkSession,
> > > > >
> > > > > which
> > > > > > > in turn is assigned to a particular IgniteContext and therefore
> > >
> > > single
> > > > > > > Ignite client. If that's the case, I don't think it should be
> > >
> > > aware of
> > > > > > > other Ignite clients that are connected to other clusters. This
> > >
> > > doesn't
> > > > > > > look like correct behavior to me, not to mention that with this
> > > > >
> > > > > approach
> > > > > > > having multiple databases would be a very rare case. I believe we
> > > > >
> > > > > should
> > > > > > > get rid of this logic and use Ignite schema name as database name
> > >
> > > in
> > > > > > > Spark's catalog.
> > > > > > >
> > > > > > > Nikolay, what do you think?
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > >
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Nikolay, Val,
> > > > > > > >
> > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > > > > ExternalCatalog implementation, it just uses the database
> > >
> > > specified
> > > > >
> > > > > in the
> > > > > > > > JDBC URL. So I don't believe there is any way to call
> > >
> > > listTables() or
> > > > > > > > listDatabases() for JDBC provider.
> > > > > > > >
> > > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > >
> > > database
> > > > >
> > > > > and
> > > > > > > > table using the actual database and table mechanisms built into
> > >
> > > the
> > > > > > > > catalog, which is fine because Hive has the clear distinction
> >
> > and
> > > > > > > > hierarchy
> > > > > > > > of databases and tables.
> > > > > > > >
> > > > > > > > *However* Ignite already uses the "database" concept in the
> > >
> > > Ignite
> > > > > > > >
> > > > > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So
> >
> > in
> > > > >
> > > > > Ignite we
> > > > > > > > have instances containing schemas containing tables, and Spark
> > >
> > > only
> > > > >
> > > > > has
> > > > > > > > the
> > > > > > > > concept of databases and tables so it seems like either we
> >
> > ignore
> > > > >
> > > > > one of
> > > > > > > > the three Ignite concepts or combine two of them into database
> >
> > or
> > > > >
> > > > > table.
> > > > > > > > The current implementation in the pull request combines Ignite
> > > > >
> > > > > schema and
> > > > > > > > table attributes into the Spark table attribute.
> > > > > > > >
> > > > > > > > Stuart.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > [2]
> > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > >
> > >
> > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > [3]
> > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > IgniteExternalCatalog.scala
> > > > > > > >
> > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > >
> > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello, Stuart.
> > > > > > > > >
> > > > > > > > > Can you do some research and find out how schema is handled
> >
> > in
> > > Data
> > > > > > > >
> > > > > > > > Frames
> > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > >
> > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > >
> > > > > > > > > > I see that the 'Table' class (returned by listTables
> >
> > method)
> > > has
> > > > >
> > > > > a
> > > > > > > > >
> > > > > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > > > > >
> > > > > > > > > > In any case, I think we should look into how this is done
> >
> > in
> > > data
> > > > > > > >
> > > > > > > > source
> > > > > > > > > implementations for other databases. Any relational database
> > >
> > > has a
> > > > > > > >
> > > > > > > > notion
> > > > > > > > > of schema, and I'm sure Spark integrations take this into
> > >
> > > account
> > > > > > > >
> > > > > > > > somehow.
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > >
> > > > > [hidden email]>
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > >
> > > > > > > > > > > Personally, I think we should change current tables
> >
> > naming
> > > and
> > > > > > > >
> > > > > > > > return
> > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > >
> > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > > > > Igniters,
> > > > > > > > > > > >
> > > > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> > >
> > > Nikolay
> > > > >
> > > > > and I
> > > > > > > >
> > > > > > > > are
> > > > > > > > > > > > discussing whether to introduce a change which may
> >
> > impact
> > > > > > > >
> > > > > > > > backwards
> > > > > > > > > > > > compatibility; Nikolay suggested we take the discussion
> > >
> > > to
> > > > >
> > > > > this
> > > > > > > >
> > > > > > > > list.
> > > > > > > > > > > >
> > > > > > > > > > > > Ignite implements a custom Spark catalog which provides
> > >
> > > an
> > > > >
> > > > > API by
> > > > > > > > >
> > > > > > > > > which
> > > > > > > > > > > > Spark users can list the tables which are available in
> > >
> > > Ignite
> > > > > > > >
> > > > > > > > which
> > > > > > > > > can be
> > > > > > > > > > > > queried via Spark SQL. Currently that table name list
> > > > >
> > > > > includes
> > > > > > > >
> > > > > > > > just
> > > > > > > > > the
> > > > > > > > > > > > names of the tables, but IGNITE-9228 is introducing a
> > >
> > > change
> > > > >
> > > > > which
> > > > > > > > >
> > > > > > > > > allows
> > > > > > > > > > > > optional prefixing of schema names to table names to
> > > > >
> > > > > disambiguate
> > > > > > > > >
> > > > > > > > > multiple
> > > > > > > > > > > > tables with the same name in different schemas. For the
> > >
> > > "list
> > > > > > > > >
> > > > > > > > > tables" API
> > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. List the tables using both their table names and
> > > > > > > >
> > > > > > > > schema-qualified
> > > > > > > > > table
> > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even
> >
> > though
> > > > >
> > > > > they are
> > > > > > > > >
> > > > > > > > > the same
> > > > > > > > > > > > underlying table. This retains backwards compatibility
> > >
> > > with
> > > > >
> > > > > users
> > > > > > > >
> > > > > > > > who
> > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > 2. List the tables using only their schema-qualified
> > >
> > > names.
> > > > >
> > > > > This
> > > > > > > > >
> > > > > > > > > eliminates
> > > > > > > > > > > > duplication of names in the catalog but will
> >
> > potentially
> > > > >
> > > > > break
> > > > > > > > > > > > compatibility with users who expect the table name in
> >
> > the
> > > > >
> > > > > catalog.
> > > > > > > > > > > >
> > > > > > > > > > > > With either option we will allow for  Spark SQL SELECT
> > > > >
> > > > > statements
> > > > > > > >
> > > > > > > > to
> > > > > > > > > use
> > > > > > > > > > > > either table name or schema-qualified table names, this
> > > > >
> > > > > change
> > > > > > > >
> > > > > > > > would
> > > > > > > > > purely
> > > > > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > > > > >
> > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Stuart.
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Valentin Kulichenko
Hi Nikolay,

I think it's actually pretty unfortunate that Spark uses term "database"
here, as it essentially refers to a schema in my view. Usually, database is
something you create a physical connection to, and connection is bind to
that database. To connect to another database you need to create a new
connection. In Spark, however, you can switch between "databases" within a
single session, which looks really weird to me because it's usually a
characteristic of a schema. Having said that, I understand your concern,
but I don't think there is an ideal solution.

As for your approach, I still don't understand how it will allow to fully
support schemas in catalog.
- How will you get a list of tables within a particular schema? In other
words, what would listTables() method return?
- How will you switch between the schemas?
- Etc.

I still think assuming database=schema is the best we can do here, but I
would be happy to hear another opinions from other community members.

OPTION_SCHEMA should definitely be introduced though (I thought we already
did, no?). CREATE TABLE will be supported with this ticket:
https://issues.apache.org/jira/browse/IGNITE-5780. For now we will have to
throw an exception if custom schema name is provided when creating a Spark
session, but table does not exist yet.

-Val

On Sun, Aug 26, 2018 at 7:56 AM Nikolay Izhikov <[hidden email]> wrote:

> Igniters,
>
> Personally, I don't like the solution with database == schema name.
>
> 1. I think we should try to use the right abstractions.
> schema == database doesn't sound right for me.
>
> Do you want to answer to all of our users something like that:
>
> - "How I can change Ignite SQL schema?"
> - "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".
>
> 2. I think we restrict whole solution with that decision.
> If Ignite will support multiple databases in the future we just don't have
> a place for it.
>
> I think we should do the following:
>
>         1. IgniteExternalCatalog should be able to return *ALL* tables
> within Ignite instance.
>         We shouldn't restrict tables list by schema by default.
>         We should return tables with schema name - `schema.table`
>
>         2. We should introduce `OPTION_SCHEMA` for a dataframe to specify
> a schema.
>
>         There is an issue with the second step: We can't use schema name
> in `CREATE TABLE` clause.
>         This is restriction of current Ignite SQL.
>
>         I propose to make the following:
>
>         1. For all write modes that requires the creation of table we
> should disallow usage of table outside of `SQL_PUBLIC`
>         or usage of `OPTION_SCHEMA`. We should throw proper exception for
> this case.
>
>         2. Create a ticket to support `CREATE TABLE` with custom schema
> name.
>
>         3. After resolving ticket from step 2 we can add full support of
> custom schema to Spark integration.
>
>         4. We should throw an exception if user try to use setDatabase.
>
> Is that makes sense for you?
>
> В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:
> > I'll go ahead and make the changes to represent the schema name as the
> > database name for the purposes of the Spark catalog.
> >
> > If anyone knows of an existing way to list all available schemata within
> an
> > Ignite instance please let me know, otherwise the first task will be
> > creating that mechanism.
> >
> > Stuart.
> >
> > On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> > [hidden email]> wrote:
> >
> > > Nikolay,
> > >
> > > If there are multiple configuration in XML, IgniteContext will always
> use
> > > only one of them. Looks like current approach simply doesn't work. I
> > > propose to report schema name as 'database' in Spark. If there are
> multiple
> > > clients, you would create multiple sessions and multiple catalogs.
> > >
> > > Makes sense?
> > >
> > > -Val
> > >
> > > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[hidden email]>
> > > wrote:
> > >
> > > > Hello, Valentin.
> > > >
> > > > > catalog exist in scope of a single IgniteSparkSession> (and
> therefore
> > > >
> > > > single IgniteContext and single Ignite instance)?
> > > >
> > > > Yes.
> > > > Actually, I was thinking about use case when we have several Ignite
> > > > configuration in one XML file.
> > > > Now I see, may be this is too rare use-case to support.
> > > >
> > > > Stuart, Valentin, What is your proposal?
> > > >
> > > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > > Nikolay,
> > > > >
> > > > > Whatever we decide on would be right :) Basically, we need to
> answer
> > >
> > > this
> > > > > question: does the catalog exist in scope of a single
> > >
> > > IgniteSparkSession
> > > > > (and therefore single IgniteContext and single Ignite instance)? In
> > >
> > > other
> > > > > words, in case of a rare use case when a single Spark application
> > > >
> > > > connects
> > > > > to multiple Ignite clusters, would there be a catalog created per
> > > >
> > > > cluster?
> > > > >
> > > > > If the answer is yes, current logic doesn't make sense.
> > > > >
> > > > > -Val
> > > > >
> > > > >
> > > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <
> [hidden email]>
> > > >
> > > > wrote:
> > > > >
> > > > > > Hello, Valentin.
> > > > > >
> > > > > > > I believe we should get rid of this logic and use Ignite schema
> > >
> > > name
> > > > as
> > > > > >
> > > > > > database name in Spark's catalog.
> > > > > >
> > > > > > When I develop Ignite integration with Spark Data Frame I use
> > >
> > > following
> > > > > > abstraction described by Vladimir Ozerov:
> > > > > >
> > > > > > "1) Let's consider Ignite cluster as a single database
> ("catalog" in
> > > >
> > > > ANSI
> > > > > > SQL'92 terms)." [1]
> > > > > >
> > > > > > Am I was wrong? If yes - let's fix it.
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > > >
> > > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > > Hi Val, yes that's correct. I'd be happy to make the change to
> have
> > > >
> > > > the
> > > > > > > database reference the schema if Nikolay agrees. (I'll first
> need
> > >
> > > to
> > > > do a
> > > > > > > bit of research into how to obtain the list of all available
> > > >
> > > > schemata...)
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Stuart.
> > > > > > >
> > > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > > > Stuart,
> > > > > > > >
> > > > > > > > Thanks for pointing this out, I was not aware that we use
> Spark
> > > > > >
> > > > > > database
> > > > > > > > concept this way. Actually, this confuses me a lot. As far
> as I
> > > > > >
> > > > > > understand,
> > > > > > > > catalog is created in the scope of a particular
> > >
> > > IgniteSparkSession,
> > > > > >
> > > > > > which
> > > > > > > > in turn is assigned to a particular IgniteContext and
> therefore
> > > >
> > > > single
> > > > > > > > Ignite client. If that's the case, I don't think it should be
> > > >
> > > > aware of
> > > > > > > > other Ignite clients that are connected to other clusters.
> This
> > > >
> > > > doesn't
> > > > > > > > look like correct behavior to me, not to mention that with
> this
> > > > > >
> > > > > > approach
> > > > > > > > having multiple databases would be a very rare case. I
> believe we
> > > > > >
> > > > > > should
> > > > > > > > get rid of this logic and use Ignite schema name as database
> name
> > > >
> > > > in
> > > > > > > > Spark's catalog.
> > > > > > > >
> > > > > > > > Nikolay, what do you think?
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > > >
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Nikolay, Val,
> > > > > > > > >
> > > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell --
> has no
> > > > > > > > > ExternalCatalog implementation, it just uses the database
> > > >
> > > > specified
> > > > > >
> > > > > > in the
> > > > > > > > > JDBC URL. So I don't believe there is any way to call
> > > >
> > > > listTables() or
> > > > > > > > > listDatabases() for JDBC provider.
> > > > > > > > >
> > > > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > > >
> > > > database
> > > > > >
> > > > > > and
> > > > > > > > > table using the actual database and table mechanisms built
> into
> > > >
> > > > the
> > > > > > > > > catalog, which is fine because Hive has the clear
> distinction
> > >
> > > and
> > > > > > > > > hierarchy
> > > > > > > > > of databases and tables.
> > > > > > > > >
> > > > > > > > > *However* Ignite already uses the "database" concept in the
> > > >
> > > > Ignite
> > > > > > > > >
> > > > > > > > > ExternalCatalog[3] to mean the name of an Ignite instance.
> So
> > >
> > > in
> > > > > >
> > > > > > Ignite we
> > > > > > > > > have instances containing schemas containing tables, and
> Spark
> > > >
> > > > only
> > > > > >
> > > > > > has
> > > > > > > > > the
> > > > > > > > > concept of databases and tables so it seems like either we
> > >
> > > ignore
> > > > > >
> > > > > > one of
> > > > > > > > > the three Ignite concepts or combine two of them into
> database
> > >
> > > or
> > > > > >
> > > > > > table.
> > > > > > > > > The current implementation in the pull request combines
> Ignite
> > > > > >
> > > > > > schema and
> > > > > > > > > table attributes into the Spark table attribute.
> > > > > > > > >
> > > > > > > > > Stuart.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > > [2]
> > > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > > >
> > > >
> > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > > [3]
> > > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > > IgniteExternalCatalog.scala
> > > > > > > > >
> > > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello, Stuart.
> > > > > > > > > >
> > > > > > > > > > Can you do some research and find out how schema is
> handled
> > >
> > > in
> > > > Data
> > > > > > > > >
> > > > > > > > > Frames
> > > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > > >
> > > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko
> пишет:
> > > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > > >
> > > > > > > > > > > I see that the 'Table' class (returned by listTables
> > >
> > > method)
> > > > has
> > > > > >
> > > > > > a
> > > > > > > > > >
> > > > > > > > > > 'database' field. Can we use this one to report schema
> name?
> > > > > > > > > > >
> > > > > > > > > > > In any case, I think we should look into how this is
> done
> > >
> > > in
> > > > data
> > > > > > > > >
> > > > > > > > > source
> > > > > > > > > > implementations for other databases. Any relational
> database
> > > >
> > > > has a
> > > > > > > > >
> > > > > > > > > notion
> > > > > > > > > > of schema, and I'm sure Spark integrations take this into
> > > >
> > > > account
> > > > > > > > >
> > > > > > > > > somehow.
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > >
> > > > > > > > > > > > Personally, I think we should change current tables
> > >
> > > naming
> > > > and
> > > > > > > > >
> > > > > > > > > return
> > > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > > >
> > > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald
> пишет:
> > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > >
> > > > > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> > > >
> > > > Nikolay
> > > > > >
> > > > > > and I
> > > > > > > > >
> > > > > > > > > are
> > > > > > > > > > > > > discussing whether to introduce a change which may
> > >
> > > impact
> > > > > > > > >
> > > > > > > > > backwards
> > > > > > > > > > > > > compatibility; Nikolay suggested we take the
> discussion
> > > >
> > > > to
> > > > > >
> > > > > > this
> > > > > > > > >
> > > > > > > > > list.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ignite implements a custom Spark catalog which
> provides
> > > >
> > > > an
> > > > > >
> > > > > > API by
> > > > > > > > > >
> > > > > > > > > > which
> > > > > > > > > > > > > Spark users can list the tables which are
> available in
> > > >
> > > > Ignite
> > > > > > > > >
> > > > > > > > > which
> > > > > > > > > > can be
> > > > > > > > > > > > > queried via Spark SQL. Currently that table name
> list
> > > > > >
> > > > > > includes
> > > > > > > > >
> > > > > > > > > just
> > > > > > > > > > the
> > > > > > > > > > > > > names of the tables, but IGNITE-9228 is
> introducing a
> > > >
> > > > change
> > > > > >
> > > > > > which
> > > > > > > > > >
> > > > > > > > > > allows
> > > > > > > > > > > > > optional prefixing of schema names to table names
> to
> > > > > >
> > > > > > disambiguate
> > > > > > > > > >
> > > > > > > > > > multiple
> > > > > > > > > > > > > tables with the same name in different schemas.
> For the
> > > >
> > > > "list
> > > > > > > > > >
> > > > > > > > > > tables" API
> > > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. List the tables using both their table names and
> > > > > > > > >
> > > > > > > > > schema-qualified
> > > > > > > > > > table
> > > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even
> > >
> > > though
> > > > > >
> > > > > > they are
> > > > > > > > > >
> > > > > > > > > > the same
> > > > > > > > > > > > > underlying table. This retains backwards
> compatibility
> > > >
> > > > with
> > > > > >
> > > > > > users
> > > > > > > > >
> > > > > > > > > who
> > > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > > 2. List the tables using only their
> schema-qualified
> > > >
> > > > names.
> > > > > >
> > > > > > This
> > > > > > > > > >
> > > > > > > > > > eliminates
> > > > > > > > > > > > > duplication of names in the catalog but will
> > >
> > > potentially
> > > > > >
> > > > > > break
> > > > > > > > > > > > > compatibility with users who expect the table name
> in
> > >
> > > the
> > > > > >
> > > > > > catalog.
> > > > > > > > > > > > >
> > > > > > > > > > > > > With either option we will allow for  Spark SQL
> SELECT
> > > > > >
> > > > > > statements
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > > > use
> > > > > > > > > > > > > either table name or schema-qualified table names,
> this
> > > > > >
> > > > > > change
> > > > > > > > >
> > > > > > > > > would
> > > > > > > > > > purely
> > > > > > > > > > > > > impact the API which is used to list available
> tables.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Stuart.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Stuart Macdonald
Nikolay, Val, it would be good if we could reach agreement here so that I
can make the necessary modifications before the 2.7 cutoff.

Nikolay - would you be comfortable if I went ahead and made database=schema?

Stuart.

On Mon, Aug 27, 2018 at 10:22 PM Valentin Kulichenko <
[hidden email]> wrote:

> Hi Nikolay,
>
> I think it's actually pretty unfortunate that Spark uses term "database"
> here, as it essentially refers to a schema in my view. Usually, database is
> something you create a physical connection to, and connection is bind to
> that database. To connect to another database you need to create a new
> connection. In Spark, however, you can switch between "databases" within a
> single session, which looks really weird to me because it's usually a
> characteristic of a schema. Having said that, I understand your concern,
> but I don't think there is an ideal solution.
>
> As for your approach, I still don't understand how it will allow to fully
> support schemas in catalog.
> - How will you get a list of tables within a particular schema? In other
> words, what would listTables() method return?
> - How will you switch between the schemas?
> - Etc.
>
> I still think assuming database=schema is the best we can do here, but I
> would be happy to hear another opinions from other community members.
>
> OPTION_SCHEMA should definitely be introduced though (I thought we already
> did, no?). CREATE TABLE will be supported with this ticket:
> https://issues.apache.org/jira/browse/IGNITE-5780. For now we will have to
> throw an exception if custom schema name is provided when creating a Spark
> session, but table does not exist yet.
>
> -Val
>
> On Sun, Aug 26, 2018 at 7:56 AM Nikolay Izhikov <[hidden email]>
> wrote:
>
> > Igniters,
> >
> > Personally, I don't like the solution with database == schema name.
> >
> > 1. I think we should try to use the right abstractions.
> > schema == database doesn't sound right for me.
> >
> > Do you want to answer to all of our users something like that:
> >
> > - "How I can change Ignite SQL schema?"
> > - "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".
> >
> > 2. I think we restrict whole solution with that decision.
> > If Ignite will support multiple databases in the future we just don't
> have
> > a place for it.
> >
> > I think we should do the following:
> >
> >         1. IgniteExternalCatalog should be able to return *ALL* tables
> > within Ignite instance.
> >         We shouldn't restrict tables list by schema by default.
> >         We should return tables with schema name - `schema.table`
> >
> >         2. We should introduce `OPTION_SCHEMA` for a dataframe to specify
> > a schema.
> >
> >         There is an issue with the second step: We can't use schema name
> > in `CREATE TABLE` clause.
> >         This is restriction of current Ignite SQL.
> >
> >         I propose to make the following:
> >
> >         1. For all write modes that requires the creation of table we
> > should disallow usage of table outside of `SQL_PUBLIC`
> >         or usage of `OPTION_SCHEMA`. We should throw proper exception for
> > this case.
> >
> >         2. Create a ticket to support `CREATE TABLE` with custom schema
> > name.
> >
> >         3. After resolving ticket from step 2 we can add full support of
> > custom schema to Spark integration.
> >
> >         4. We should throw an exception if user try to use setDatabase.
> >
> > Is that makes sense for you?
> >
> > В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:
> > > I'll go ahead and make the changes to represent the schema name as the
> > > database name for the purposes of the Spark catalog.
> > >
> > > If anyone knows of an existing way to list all available schemata
> within
> > an
> > > Ignite instance please let me know, otherwise the first task will be
> > > creating that mechanism.
> > >
> > > Stuart.
> > >
> > > On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> > > [hidden email]> wrote:
> > >
> > > > Nikolay,
> > > >
> > > > If there are multiple configuration in XML, IgniteContext will always
> > use
> > > > only one of them. Looks like current approach simply doesn't work. I
> > > > propose to report schema name as 'database' in Spark. If there are
> > multiple
> > > > clients, you would create multiple sessions and multiple catalogs.
> > > >
> > > > Makes sense?
> > > >
> > > > -Val
> > > >
> > > > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello, Valentin.
> > > > >
> > > > > > catalog exist in scope of a single IgniteSparkSession> (and
> > therefore
> > > > >
> > > > > single IgniteContext and single Ignite instance)?
> > > > >
> > > > > Yes.
> > > > > Actually, I was thinking about use case when we have several Ignite
> > > > > configuration in one XML file.
> > > > > Now I see, may be this is too rare use-case to support.
> > > > >
> > > > > Stuart, Valentin, What is your proposal?
> > > > >
> > > > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > > > Nikolay,
> > > > > >
> > > > > > Whatever we decide on would be right :) Basically, we need to
> > answer
> > > >
> > > > this
> > > > > > question: does the catalog exist in scope of a single
> > > >
> > > > IgniteSparkSession
> > > > > > (and therefore single IgniteContext and single Ignite instance)?
> In
> > > >
> > > > other
> > > > > > words, in case of a rare use case when a single Spark application
> > > > >
> > > > > connects
> > > > > > to multiple Ignite clusters, would there be a catalog created per
> > > > >
> > > > > cluster?
> > > > > >
> > > > > > If the answer is yes, current logic doesn't make sense.
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <
> > [hidden email]>
> > > > >
> > > > > wrote:
> > > > > >
> > > > > > > Hello, Valentin.
> > > > > > >
> > > > > > > > I believe we should get rid of this logic and use Ignite
> schema
> > > >
> > > > name
> > > > > as
> > > > > > >
> > > > > > > database name in Spark's catalog.
> > > > > > >
> > > > > > > When I develop Ignite integration with Spark Data Frame I use
> > > >
> > > > following
> > > > > > > abstraction described by Vladimir Ozerov:
> > > > > > >
> > > > > > > "1) Let's consider Ignite cluster as a single database
> > ("catalog" in
> > > > >
> > > > > ANSI
> > > > > > > SQL'92 terms)." [1]
> > > > > > >
> > > > > > > Am I was wrong? If yes - let's fix it.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > >
> > > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > > > >
> > > > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > > > Hi Val, yes that's correct. I'd be happy to make the change
> to
> > have
> > > > >
> > > > > the
> > > > > > > > database reference the schema if Nikolay agrees. (I'll first
> > need
> > > >
> > > > to
> > > > > do a
> > > > > > > > bit of research into how to obtain the list of all available
> > > > >
> > > > > schemata...)
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Stuart.
> > > > > > > >
> > > > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > > > Stuart,
> > > > > > > > >
> > > > > > > > > Thanks for pointing this out, I was not aware that we use
> > Spark
> > > > > > >
> > > > > > > database
> > > > > > > > > concept this way. Actually, this confuses me a lot. As far
> > as I
> > > > > > >
> > > > > > > understand,
> > > > > > > > > catalog is created in the scope of a particular
> > > >
> > > > IgniteSparkSession,
> > > > > > >
> > > > > > > which
> > > > > > > > > in turn is assigned to a particular IgniteContext and
> > therefore
> > > > >
> > > > > single
> > > > > > > > > Ignite client. If that's the case, I don't think it should
> be
> > > > >
> > > > > aware of
> > > > > > > > > other Ignite clients that are connected to other clusters.
> > This
> > > > >
> > > > > doesn't
> > > > > > > > > look like correct behavior to me, not to mention that with
> > this
> > > > > > >
> > > > > > > approach
> > > > > > > > > having multiple databases would be a very rare case. I
> > believe we
> > > > > > >
> > > > > > > should
> > > > > > > > > get rid of this logic and use Ignite schema name as
> database
> > name
> > > > >
> > > > > in
> > > > > > > > > Spark's catalog.
> > > > > > > > >
> > > > > > > > > Nikolay, what do you think?
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > > > >
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Nikolay, Val,
> > > > > > > > > >
> > > > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell --
> > has no
> > > > > > > > > > ExternalCatalog implementation, it just uses the database
> > > > >
> > > > > specified
> > > > > > >
> > > > > > > in the
> > > > > > > > > > JDBC URL. So I don't believe there is any way to call
> > > > >
> > > > > listTables() or
> > > > > > > > > > listDatabases() for JDBC provider.
> > > > > > > > > >
> > > > > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > > > >
> > > > > database
> > > > > > >
> > > > > > > and
> > > > > > > > > > table using the actual database and table mechanisms
> built
> > into
> > > > >
> > > > > the
> > > > > > > > > > catalog, which is fine because Hive has the clear
> > distinction
> > > >
> > > > and
> > > > > > > > > > hierarchy
> > > > > > > > > > of databases and tables.
> > > > > > > > > >
> > > > > > > > > > *However* Ignite already uses the "database" concept in
> the
> > > > >
> > > > > Ignite
> > > > > > > > > >
> > > > > > > > > > ExternalCatalog[3] to mean the name of an Ignite
> instance.
> > So
> > > >
> > > > in
> > > > > > >
> > > > > > > Ignite we
> > > > > > > > > > have instances containing schemas containing tables, and
> > Spark
> > > > >
> > > > > only
> > > > > > >
> > > > > > > has
> > > > > > > > > > the
> > > > > > > > > > concept of databases and tables so it seems like either
> we
> > > >
> > > > ignore
> > > > > > >
> > > > > > > one of
> > > > > > > > > > the three Ignite concepts or combine two of them into
> > database
> > > >
> > > > or
> > > > > > >
> > > > > > > table.
> > > > > > > > > > The current implementation in the pull request combines
> > Ignite
> > > > > > >
> > > > > > > schema and
> > > > > > > > > > table attributes into the Spark table attribute.
> > > > > > > > > >
> > > > > > > > > > Stuart.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > > > [2]
> > > > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > > > >
> > > > >
> > > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > > > [3]
> > > > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > > > IgniteExternalCatalog.scala
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > >
> > > > > > > > > > > Can you do some research and find out how schema is
> > handled
> > > >
> > > > in
> > > > > Data
> > > > > > > > > >
> > > > > > > > > > Frames
> > > > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > > > >
> > > > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko
> > пишет:
> > > > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > > > >
> > > > > > > > > > > > I see that the 'Table' class (returned by listTables
> > > >
> > > > method)
> > > > > has
> > > > > > >
> > > > > > > a
> > > > > > > > > > >
> > > > > > > > > > > 'database' field. Can we use this one to report schema
> > name?
> > > > > > > > > > > >
> > > > > > > > > > > > In any case, I think we should look into how this is
> > done
> > > >
> > > > in
> > > > > data
> > > > > > > > > >
> > > > > > > > > > source
> > > > > > > > > > > implementations for other databases. Any relational
> > database
> > > > >
> > > > > has a
> > > > > > > > > >
> > > > > > > > > > notion
> > > > > > > > > > > of schema, and I'm sure Spark integrations take this
> into
> > > > >
> > > > > account
> > > > > > > > > >
> > > > > > > > > > somehow.
> > > > > > > > > > > >
> > > > > > > > > > > > -Val
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Personally, I think we should change current tables
> > > >
> > > > naming
> > > > > and
> > > > > > > > > >
> > > > > > > > > > return
> > > > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald
> > пишет:
> > > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > While reviewing the changes for IGNITE-9228
> [1,2],
> > > > >
> > > > > Nikolay
> > > > > > >
> > > > > > > and I
> > > > > > > > > >
> > > > > > > > > > are
> > > > > > > > > > > > > > discussing whether to introduce a change which
> may
> > > >
> > > > impact
> > > > > > > > > >
> > > > > > > > > > backwards
> > > > > > > > > > > > > > compatibility; Nikolay suggested we take the
> > discussion
> > > > >
> > > > > to
> > > > > > >
> > > > > > > this
> > > > > > > > > >
> > > > > > > > > > list.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ignite implements a custom Spark catalog which
> > provides
> > > > >
> > > > > an
> > > > > > >
> > > > > > > API by
> > > > > > > > > > >
> > > > > > > > > > > which
> > > > > > > > > > > > > > Spark users can list the tables which are
> > available in
> > > > >
> > > > > Ignite
> > > > > > > > > >
> > > > > > > > > > which
> > > > > > > > > > > can be
> > > > > > > > > > > > > > queried via Spark SQL. Currently that table name
> > list
> > > > > > >
> > > > > > > includes
> > > > > > > > > >
> > > > > > > > > > just
> > > > > > > > > > > the
> > > > > > > > > > > > > > names of the tables, but IGNITE-9228 is
> > introducing a
> > > > >
> > > > > change
> > > > > > >
> > > > > > > which
> > > > > > > > > > >
> > > > > > > > > > > allows
> > > > > > > > > > > > > > optional prefixing of schema names to table names
> > to
> > > > > > >
> > > > > > > disambiguate
> > > > > > > > > > >
> > > > > > > > > > > multiple
> > > > > > > > > > > > > > tables with the same name in different schemas.
> > For the
> > > > >
> > > > > "list
> > > > > > > > > > >
> > > > > > > > > > > tables" API
> > > > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. List the tables using both their table names
> and
> > > > > > > > > >
> > > > > > > > > > schema-qualified
> > > > > > > > > > > table
> > > > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ])
> even
> > > >
> > > > though
> > > > > > >
> > > > > > > they are
> > > > > > > > > > >
> > > > > > > > > > > the same
> > > > > > > > > > > > > > underlying table. This retains backwards
> > compatibility
> > > > >
> > > > > with
> > > > > > >
> > > > > > > users
> > > > > > > > > >
> > > > > > > > > > who
> > > > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > > > 2. List the tables using only their
> > schema-qualified
> > > > >
> > > > > names.
> > > > > > >
> > > > > > > This
> > > > > > > > > > >
> > > > > > > > > > > eliminates
> > > > > > > > > > > > > > duplication of names in the catalog but will
> > > >
> > > > potentially
> > > > > > >
> > > > > > > break
> > > > > > > > > > > > > > compatibility with users who expect the table
> name
> > in
> > > >
> > > > the
> > > > > > >
> > > > > > > catalog.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With either option we will allow for  Spark SQL
> > SELECT
> > > > > > >
> > > > > > > statements
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > > > use
> > > > > > > > > > > > > > either table name or schema-qualified table
> names,
> > this
> > > > > > >
> > > > > > > change
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > > > purely
> > > > > > > > > > > > > > impact the API which is used to list available
> > tables.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Stuart.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551
>
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Stuart.

Sorry for the silence.

I was swamped the last couple of days.

I think you can go forward and implement suggested solution.
I'm -0 with it.
So no block from my side, but I'm still no happy with abstractions :).

В Пн, 03/09/2018 в 09:35 +0100, Stuart Macdonald пишет:

> Nikolay, Val, it would be good if we could reach agreement here so that I
> can make the necessary modifications before the 2.7 cutoff.
>
> Nikolay - would you be comfortable if I went ahead and made database=schema?
>
> Stuart.
>
> On Mon, Aug 27, 2018 at 10:22 PM Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Hi Nikolay,
> >
> > I think it's actually pretty unfortunate that Spark uses term "database"
> > here, as it essentially refers to a schema in my view. Usually, database is
> > something you create a physical connection to, and connection is bind to
> > that database. To connect to another database you need to create a new
> > connection. In Spark, however, you can switch between "databases" within a
> > single session, which looks really weird to me because it's usually a
> > characteristic of a schema. Having said that, I understand your concern,
> > but I don't think there is an ideal solution.
> >
> > As for your approach, I still don't understand how it will allow to fully
> > support schemas in catalog.
> > - How will you get a list of tables within a particular schema? In other
> > words, what would listTables() method return?
> > - How will you switch between the schemas?
> > - Etc.
> >
> > I still think assuming database=schema is the best we can do here, but I
> > would be happy to hear another opinions from other community members.
> >
> > OPTION_SCHEMA should definitely be introduced though (I thought we already
> > did, no?). CREATE TABLE will be supported with this ticket:
> > https://issues.apache.org/jira/browse/IGNITE-5780. For now we will have to
> > throw an exception if custom schema name is provided when creating a Spark
> > session, but table does not exist yet.
> >
> > -Val
> >
> > On Sun, Aug 26, 2018 at 7:56 AM Nikolay Izhikov <[hidden email]>
> > wrote:
> >
> > > Igniters,
> > >
> > > Personally, I don't like the solution with database == schema name.
> > >
> > > 1. I think we should try to use the right abstractions.
> > > schema == database doesn't sound right for me.
> > >
> > > Do you want to answer to all of our users something like that:
> > >
> > > - "How I can change Ignite SQL schema?"
> > > - "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".
> > >
> > > 2. I think we restrict whole solution with that decision.
> > > If Ignite will support multiple databases in the future we just don't
> >
> > have
> > > a place for it.
> > >
> > > I think we should do the following:
> > >
> > >         1. IgniteExternalCatalog should be able to return *ALL* tables
> > > within Ignite instance.
> > >         We shouldn't restrict tables list by schema by default.
> > >         We should return tables with schema name - `schema.table`
> > >
> > >         2. We should introduce `OPTION_SCHEMA` for a dataframe to specify
> > > a schema.
> > >
> > >         There is an issue with the second step: We can't use schema name
> > > in `CREATE TABLE` clause.
> > >         This is restriction of current Ignite SQL.
> > >
> > >         I propose to make the following:
> > >
> > >         1. For all write modes that requires the creation of table we
> > > should disallow usage of table outside of `SQL_PUBLIC`
> > >         or usage of `OPTION_SCHEMA`. We should throw proper exception for
> > > this case.
> > >
> > >         2. Create a ticket to support `CREATE TABLE` with custom schema
> > > name.
> > >
> > >         3. After resolving ticket from step 2 we can add full support of
> > > custom schema to Spark integration.
> > >
> > >         4. We should throw an exception if user try to use setDatabase.
> > >
> > > Is that makes sense for you?
> > >
> > > В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:
> > > > I'll go ahead and make the changes to represent the schema name as the
> > > > database name for the purposes of the Spark catalog.
> > > >
> > > > If anyone knows of an existing way to list all available schemata
> >
> > within
> > > an
> > > > Ignite instance please let me know, otherwise the first task will be
> > > > creating that mechanism.
> > > >
> > > > Stuart.
> > > >
> > > > On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> > > > [hidden email]> wrote:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > If there are multiple configuration in XML, IgniteContext will always
> > >
> > > use
> > > > > only one of them. Looks like current approach simply doesn't work. I
> > > > > propose to report schema name as 'database' in Spark. If there are
> > >
> > > multiple
> > > > > clients, you would create multiple sessions and multiple catalogs.
> > > > >
> > > > > Makes sense?
> > > > >
> > > > > -Val
> > > > >
> > > > > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <
> >
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hello, Valentin.
> > > > > >
> > > > > > > catalog exist in scope of a single IgniteSparkSession> (and
> > >
> > > therefore
> > > > > >
> > > > > > single IgniteContext and single Ignite instance)?
> > > > > >
> > > > > > Yes.
> > > > > > Actually, I was thinking about use case when we have several Ignite
> > > > > > configuration in one XML file.
> > > > > > Now I see, may be this is too rare use-case to support.
> > > > > >
> > > > > > Stuart, Valentin, What is your proposal?
> > > > > >
> > > > > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > > > > Nikolay,
> > > > > > >
> > > > > > > Whatever we decide on would be right :) Basically, we need to
> > >
> > > answer
> > > > >
> > > > > this
> > > > > > > question: does the catalog exist in scope of a single
> > > > >
> > > > > IgniteSparkSession
> > > > > > > (and therefore single IgniteContext and single Ignite instance)?
> >
> > In
> > > > >
> > > > > other
> > > > > > > words, in case of a rare use case when a single Spark application
> > > > > >
> > > > > > connects
> > > > > > > to multiple Ignite clusters, would there be a catalog created per
> > > > > >
> > > > > > cluster?
> > > > > > >
> > > > > > > If the answer is yes, current logic doesn't make sense.
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <
> > >
> > > [hidden email]>
> > > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello, Valentin.
> > > > > > > >
> > > > > > > > > I believe we should get rid of this logic and use Ignite
> >
> > schema
> > > > >
> > > > > name
> > > > > > as
> > > > > > > >
> > > > > > > > database name in Spark's catalog.
> > > > > > > >
> > > > > > > > When I develop Ignite integration with Spark Data Frame I use
> > > > >
> > > > > following
> > > > > > > > abstraction described by Vladimir Ozerov:
> > > > > > > >
> > > > > > > > "1) Let's consider Ignite cluster as a single database
> > >
> > > ("catalog" in
> > > > > >
> > > > > > ANSI
> > > > > > > > SQL'92 terms)." [1]
> > > > > > > >
> > > > > > > > Am I was wrong? If yes - let's fix it.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > >
> > > > >
> >
> > http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > > > > >
> > > > > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > > > > Hi Val, yes that's correct. I'd be happy to make the change
> >
> > to
> > > have
> > > > > >
> > > > > > the
> > > > > > > > > database reference the schema if Nikolay agrees. (I'll first
> > >
> > > need
> > > > >
> > > > > to
> > > > > > do a
> > > > > > > > > bit of research into how to obtain the list of all available
> > > > > >
> > > > > > schemata...)
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Stuart.
> > > > > > > > >
> > > > > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > >
> > > > > > > > > > Stuart,
> > > > > > > > > >
> > > > > > > > > > Thanks for pointing this out, I was not aware that we use
> > >
> > > Spark
> > > > > > > >
> > > > > > > > database
> > > > > > > > > > concept this way. Actually, this confuses me a lot. As far
> > >
> > > as I
> > > > > > > >
> > > > > > > > understand,
> > > > > > > > > > catalog is created in the scope of a particular
> > > > >
> > > > > IgniteSparkSession,
> > > > > > > >
> > > > > > > > which
> > > > > > > > > > in turn is assigned to a particular IgniteContext and
> > >
> > > therefore
> > > > > >
> > > > > > single
> > > > > > > > > > Ignite client. If that's the case, I don't think it should
> >
> > be
> > > > > >
> > > > > > aware of
> > > > > > > > > > other Ignite clients that are connected to other clusters.
> > >
> > > This
> > > > > >
> > > > > > doesn't
> > > > > > > > > > look like correct behavior to me, not to mention that with
> > >
> > > this
> > > > > > > >
> > > > > > > > approach
> > > > > > > > > > having multiple databases would be a very rare case. I
> > >
> > > believe we
> > > > > > > >
> > > > > > > > should
> > > > > > > > > > get rid of this logic and use Ignite schema name as
> >
> > database
> > > name
> > > > > >
> > > > > > in
> > > > > > > > > > Spark's catalog.
> > > > > > > > > >
> > > > > > > > > > Nikolay, what do you think?
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Nikolay, Val,
> > > > > > > > > > >
> > > > > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell --
> > >
> > > has no
> > > > > > > > > > > ExternalCatalog implementation, it just uses the database
> > > > > >
> > > > > > specified
> > > > > > > >
> > > > > > > > in the
> > > > > > > > > > > JDBC URL. So I don't believe there is any way to call
> > > > > >
> > > > > > listTables() or
> > > > > > > > > > > listDatabases() for JDBC provider.
> > > > > > > > > > >
> > > > > > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > > > > >
> > > > > > database
> > > > > > > >
> > > > > > > > and
> > > > > > > > > > > table using the actual database and table mechanisms
> >
> > built
> > > into
> > > > > >
> > > > > > the
> > > > > > > > > > > catalog, which is fine because Hive has the clear
> > >
> > > distinction
> > > > >
> > > > > and
> > > > > > > > > > > hierarchy
> > > > > > > > > > > of databases and tables.
> > > > > > > > > > >
> > > > > > > > > > > *However* Ignite already uses the "database" concept in
> >
> > the
> > > > > >
> > > > > > Ignite
> > > > > > > > > > >
> > > > > > > > > > > ExternalCatalog[3] to mean the name of an Ignite
> >
> > instance.
> > > So
> > > > >
> > > > > in
> > > > > > > >
> > > > > > > > Ignite we
> > > > > > > > > > > have instances containing schemas containing tables, and
> > >
> > > Spark
> > > > > >
> > > > > > only
> > > > > > > >
> > > > > > > > has
> > > > > > > > > > > the
> > > > > > > > > > > concept of databases and tables so it seems like either
> >
> > we
> > > > >
> > > > > ignore
> > > > > > > >
> > > > > > > > one of
> > > > > > > > > > > the three Ignite concepts or combine two of them into
> > >
> > > database
> > > > >
> > > > > or
> > > > > > > >
> > > > > > > > table.
> > > > > > > > > > > The current implementation in the pull request combines
> > >
> > > Ignite
> > > > > > > >
> > > > > > > > schema and
> > > > > > > > > > > table attributes into the Spark table attribute.
> > > > > > > > > > >
> > > > > > > > > > > Stuart.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > > > > [2]
> > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > > > > >
> > > > > >
> > > > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > > > > [3]
> > > > > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > > > > IgniteExternalCatalog.scala
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > >
> > > > > > > > > > > > Can you do some research and find out how schema is
> > >
> > > handled
> > > > >
> > > > > in
> > > > > > Data
> > > > > > > > > > >
> > > > > > > > > > > Frames
> > > > > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > > > > >
> > > > > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko
> > >
> > > пишет:
> > > > > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I see that the 'Table' class (returned by listTables
> > > > >
> > > > > method)
> > > > > > has
> > > > > > > >
> > > > > > > > a
> > > > > > > > > > > >
> > > > > > > > > > > > 'database' field. Can we use this one to report schema
> > >
> > > name?
> > > > > > > > > > > > >
> > > > > > > > > > > > > In any case, I think we should look into how this is
> > >
> > > done
> > > > >
> > > > > in
> > > > > > data
> > > > > > > > > > >
> > > > > > > > > > > source
> > > > > > > > > > > > implementations for other databases. Any relational
> > >
> > > database
> > > > > >
> > > > > > has a
> > > > > > > > > > >
> > > > > > > > > > > notion
> > > > > > > > > > > > of schema, and I'm sure Spark integrations take this
> >
> > into
> > > > > >
> > > > > > account
> > > > > > > > > > >
> > > > > > > > > > > somehow.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Val
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Personally, I think we should change current tables
> > > > >
> > > > > naming
> > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > return
> > > > > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald
> > >
> > > пишет:
> > > > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > While reviewing the changes for IGNITE-9228
> >
> > [1,2],
> > > > > >
> > > > > > Nikolay
> > > > > > > >
> > > > > > > > and I
> > > > > > > > > > >
> > > > > > > > > > > are
> > > > > > > > > > > > > > > discussing whether to introduce a change which
> >
> > may
> > > > >
> > > > > impact
> > > > > > > > > > >
> > > > > > > > > > > backwards
> > > > > > > > > > > > > > > compatibility; Nikolay suggested we take the
> > >
> > > discussion
> > > > > >
> > > > > > to
> > > > > > > >
> > > > > > > > this
> > > > > > > > > > >
> > > > > > > > > > > list.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ignite implements a custom Spark catalog which
> > >
> > > provides
> > > > > >
> > > > > > an
> > > > > > > >
> > > > > > > > API by
> > > > > > > > > > > >
> > > > > > > > > > > > which
> > > > > > > > > > > > > > > Spark users can list the tables which are
> > >
> > > available in
> > > > > >
> > > > > > Ignite
> > > > > > > > > > >
> > > > > > > > > > > which
> > > > > > > > > > > > can be
> > > > > > > > > > > > > > > queried via Spark SQL. Currently that table name
> > >
> > > list
> > > > > > > >
> > > > > > > > includes
> > > > > > > > > > >
> > > > > > > > > > > just
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > names of the tables, but IGNITE-9228 is
> > >
> > > introducing a
> > > > > >
> > > > > > change
> > > > > > > >
> > > > > > > > which
> > > > > > > > > > > >
> > > > > > > > > > > > allows
> > > > > > > > > > > > > > > optional prefixing of schema names to table names
> > >
> > > to
> > > > > > > >
> > > > > > > > disambiguate
> > > > > > > > > > > >
> > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > tables with the same name in different schemas.
> > >
> > > For the
> > > > > >
> > > > > > "list
> > > > > > > > > > > >
> > > > > > > > > > > > tables" API
> > > > > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. List the tables using both their table names
> >
> > and
> > > > > > > > > > >
> > > > > > > > > > > schema-qualified
> > > > > > > > > > > > table
> > > > > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ])
> >
> > even
> > > > >
> > > > > though
> > > > > > > >
> > > > > > > > they are
> > > > > > > > > > > >
> > > > > > > > > > > > the same
> > > > > > > > > > > > > > > underlying table. This retains backwards
> > >
> > > compatibility
> > > > > >
> > > > > > with
> > > > > > > >
> > > > > > > > users
> > > > > > > > > > >
> > > > > > > > > > > who
> > > > > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > > > > 2. List the tables using only their
> > >
> > > schema-qualified
> > > > > >
> > > > > > names.
> > > > > > > >
> > > > > > > > This
> > > > > > > > > > > >
> > > > > > > > > > > > eliminates
> > > > > > > > > > > > > > > duplication of names in the catalog but will
> > > > >
> > > > > potentially
> > > > > > > >
> > > > > > > > break
> > > > > > > > > > > > > > > compatibility with users who expect the table
> >
> > name
> > > in
> > > > >
> > > > > the
> > > > > > > >
> > > > > > > > catalog.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With either option we will allow for  Spark SQL
> > >
> > > SELECT
> > > > > > > >
> > > > > > > > statements
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > > > use
> > > > > > > > > > > > > > > either table name or schema-qualified table
> >
> > names,
> > > this
> > > > > > > >
> > > > > > > > change
> > > > > > > > > > >
> > > > > > > > > > > would
> > > > > > > > > > > > purely
> > > > > > > > > > > > > > > impact the API which is used to list available
> > >
> > > tables.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Stuart.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > >
> > > https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Table Names in Spark Catalog

Nikolay Izhikov-2
Hello, Stuart.

Do you need any assistance with this task from me or other community member?

В Вт, 04/09/2018 в 19:03 +0300, Nikolay Izhikov пишет:

> Hello, Stuart.
>
> Sorry for the silence.
>
> I was swamped the last couple of days.
>
> I think you can go forward and implement suggested solution.
> I'm -0 with it.
> So no block from my side, but I'm still no happy with abstractions :).
>
> В Пн, 03/09/2018 в 09:35 +0100, Stuart Macdonald пишет:
> > Nikolay, Val, it would be good if we could reach agreement here so that I
> > can make the necessary modifications before the 2.7 cutoff.
> >
> > Nikolay - would you be comfortable if I went ahead and made database=schema?
> >
> > Stuart.
> >
> > On Mon, Aug 27, 2018 at 10:22 PM Valentin Kulichenko <
> > [hidden email]> wrote:
> >
> > > Hi Nikolay,
> > >
> > > I think it's actually pretty unfortunate that Spark uses term "database"
> > > here, as it essentially refers to a schema in my view. Usually, database is
> > > something you create a physical connection to, and connection is bind to
> > > that database. To connect to another database you need to create a new
> > > connection. In Spark, however, you can switch between "databases" within a
> > > single session, which looks really weird to me because it's usually a
> > > characteristic of a schema. Having said that, I understand your concern,
> > > but I don't think there is an ideal solution.
> > >
> > > As for your approach, I still don't understand how it will allow to fully
> > > support schemas in catalog.
> > > - How will you get a list of tables within a particular schema? In other
> > > words, what would listTables() method return?
> > > - How will you switch between the schemas?
> > > - Etc.
> > >
> > > I still think assuming database=schema is the best we can do here, but I
> > > would be happy to hear another opinions from other community members.
> > >
> > > OPTION_SCHEMA should definitely be introduced though (I thought we already
> > > did, no?). CREATE TABLE will be supported with this ticket:
> > > https://issues.apache.org/jira/browse/IGNITE-5780. For now we will have to
> > > throw an exception if custom schema name is provided when creating a Spark
> > > session, but table does not exist yet.
> > >
> > > -Val
> > >
> > > On Sun, Aug 26, 2018 at 7:56 AM Nikolay Izhikov <[hidden email]>
> > > wrote:
> > >
> > > > Igniters,
> > > >
> > > > Personally, I don't like the solution with database == schema name.
> > > >
> > > > 1. I think we should try to use the right abstractions.
> > > > schema == database doesn't sound right for me.
> > > >
> > > > Do you want to answer to all of our users something like that:
> > > >
> > > > - "How I can change Ignite SQL schema?"
> > > > - "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".
> > > >
> > > > 2. I think we restrict whole solution with that decision.
> > > > If Ignite will support multiple databases in the future we just don't
> > >
> > > have
> > > > a place for it.
> > > >
> > > > I think we should do the following:
> > > >
> > > >         1. IgniteExternalCatalog should be able to return *ALL* tables
> > > > within Ignite instance.
> > > >         We shouldn't restrict tables list by schema by default.
> > > >         We should return tables with schema name - `schema.table`
> > > >
> > > >         2. We should introduce `OPTION_SCHEMA` for a dataframe to specify
> > > > a schema.
> > > >
> > > >         There is an issue with the second step: We can't use schema name
> > > > in `CREATE TABLE` clause.
> > > >         This is restriction of current Ignite SQL.
> > > >
> > > >         I propose to make the following:
> > > >
> > > >         1. For all write modes that requires the creation of table we
> > > > should disallow usage of table outside of `SQL_PUBLIC`
> > > >         or usage of `OPTION_SCHEMA`. We should throw proper exception for
> > > > this case.
> > > >
> > > >         2. Create a ticket to support `CREATE TABLE` with custom schema
> > > > name.
> > > >
> > > >         3. After resolving ticket from step 2 we can add full support of
> > > > custom schema to Spark integration.
> > > >
> > > >         4. We should throw an exception if user try to use setDatabase.
> > > >
> > > > Is that makes sense for you?
> > > >
> > > > В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:
> > > > > I'll go ahead and make the changes to represent the schema name as the
> > > > > database name for the purposes of the Spark catalog.
> > > > >
> > > > > If anyone knows of an existing way to list all available schemata
> > >
> > > within
> > > > an
> > > > > Ignite instance please let me know, otherwise the first task will be
> > > > > creating that mechanism.
> > > > >
> > > > > Stuart.
> > > > >
> > > > > On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > Nikolay,
> > > > > >
> > > > > > If there are multiple configuration in XML, IgniteContext will always
> > > >
> > > > use
> > > > > > only one of them. Looks like current approach simply doesn't work. I
> > > > > > propose to report schema name as 'database' in Spark. If there are
> > > >
> > > > multiple
> > > > > > clients, you would create multiple sessions and multiple catalogs.
> > > > > >
> > > > > > Makes sense?
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <
> > >
> > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello, Valentin.
> > > > > > >
> > > > > > > > catalog exist in scope of a single IgniteSparkSession> (and
> > > >
> > > > therefore
> > > > > > >
> > > > > > > single IgniteContext and single Ignite instance)?
> > > > > > >
> > > > > > > Yes.
> > > > > > > Actually, I was thinking about use case when we have several Ignite
> > > > > > > configuration in one XML file.
> > > > > > > Now I see, may be this is too rare use-case to support.
> > > > > > >
> > > > > > > Stuart, Valentin, What is your proposal?
> > > > > > >
> > > > > > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > > > > > Nikolay,
> > > > > > > >
> > > > > > > > Whatever we decide on would be right :) Basically, we need to
> > > >
> > > > answer
> > > > > >
> > > > > > this
> > > > > > > > question: does the catalog exist in scope of a single
> > > > > >
> > > > > > IgniteSparkSession
> > > > > > > > (and therefore single IgniteContext and single Ignite instance)?
> > >
> > > In
> > > > > >
> > > > > > other
> > > > > > > > words, in case of a rare use case when a single Spark application
> > > > > > >
> > > > > > > connects
> > > > > > > > to multiple Ignite clusters, would there be a catalog created per
> > > > > > >
> > > > > > > cluster?
> > > > > > > >
> > > > > > > > If the answer is yes, current logic doesn't make sense.
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <
> > > >
> > > > [hidden email]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello, Valentin.
> > > > > > > > >
> > > > > > > > > > I believe we should get rid of this logic and use Ignite
> > >
> > > schema
> > > > > >
> > > > > > name
> > > > > > > as
> > > > > > > > >
> > > > > > > > > database name in Spark's catalog.
> > > > > > > > >
> > > > > > > > > When I develop Ignite integration with Spark Data Frame I use
> > > > > >
> > > > > > following
> > > > > > > > > abstraction described by Vladimir Ozerov:
> > > > > > > > >
> > > > > > > > > "1) Let's consider Ignite cluster as a single database
> > > >
> > > > ("catalog" in
> > > > > > >
> > > > > > > ANSI
> > > > > > > > > SQL'92 terms)." [1]
> > > > > > > > >
> > > > > > > > > Am I was wrong? If yes - let's fix it.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > >
> > > > > >
> > >
> > > http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > > > > > >
> > > > > > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > > > > > Hi Val, yes that's correct. I'd be happy to make the change
> > >
> > > to
> > > > have
> > > > > > >
> > > > > > > the
> > > > > > > > > > database reference the schema if Nikolay agrees. (I'll first
> > > >
> > > > need
> > > > > >
> > > > > > to
> > > > > > > do a
> > > > > > > > > > bit of research into how to obtain the list of all available
> > > > > > >
> > > > > > > schemata...)
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Stuart.
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > > Stuart,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for pointing this out, I was not aware that we use
> > > >
> > > > Spark
> > > > > > > > >
> > > > > > > > > database
> > > > > > > > > > > concept this way. Actually, this confuses me a lot. As far
> > > >
> > > > as I
> > > > > > > > >
> > > > > > > > > understand,
> > > > > > > > > > > catalog is created in the scope of a particular
> > > > > >
> > > > > > IgniteSparkSession,
> > > > > > > > >
> > > > > > > > > which
> > > > > > > > > > > in turn is assigned to a particular IgniteContext and
> > > >
> > > > therefore
> > > > > > >
> > > > > > > single
> > > > > > > > > > > Ignite client. If that's the case, I don't think it should
> > >
> > > be
> > > > > > >
> > > > > > > aware of
> > > > > > > > > > > other Ignite clients that are connected to other clusters.
> > > >
> > > > This
> > > > > > >
> > > > > > > doesn't
> > > > > > > > > > > look like correct behavior to me, not to mention that with
> > > >
> > > > this
> > > > > > > > >
> > > > > > > > > approach
> > > > > > > > > > > having multiple databases would be a very rare case. I
> > > >
> > > > believe we
> > > > > > > > >
> > > > > > > > > should
> > > > > > > > > > > get rid of this logic and use Ignite schema name as
> > >
> > > database
> > > > name
> > > > > > >
> > > > > > > in
> > > > > > > > > > > Spark's catalog.
> > > > > > > > > > >
> > > > > > > > > > > Nikolay, what do you think?
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Nikolay, Val,
> > > > > > > > > > > >
> > > > > > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell --
> > > >
> > > > has no
> > > > > > > > > > > > ExternalCatalog implementation, it just uses the database
> > > > > > >
> > > > > > > specified
> > > > > > > > >
> > > > > > > > > in the
> > > > > > > > > > > > JDBC URL. So I don't believe there is any way to call
> > > > > > >
> > > > > > > listTables() or
> > > > > > > > > > > > listDatabases() for JDBC provider.
> > > > > > > > > > > >
> > > > > > > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > > > > > >
> > > > > > > database
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > table using the actual database and table mechanisms
> > >
> > > built
> > > > into
> > > > > > >
> > > > > > > the
> > > > > > > > > > > > catalog, which is fine because Hive has the clear
> > > >
> > > > distinction
> > > > > >
> > > > > > and
> > > > > > > > > > > > hierarchy
> > > > > > > > > > > > of databases and tables.
> > > > > > > > > > > >
> > > > > > > > > > > > *However* Ignite already uses the "database" concept in
> > >
> > > the
> > > > > > >
> > > > > > > Ignite
> > > > > > > > > > > >
> > > > > > > > > > > > ExternalCatalog[3] to mean the name of an Ignite
> > >
> > > instance.
> > > > So
> > > > > >
> > > > > > in
> > > > > > > > >
> > > > > > > > > Ignite we
> > > > > > > > > > > > have instances containing schemas containing tables, and
> > > >
> > > > Spark
> > > > > > >
> > > > > > > only
> > > > > > > > >
> > > > > > > > > has
> > > > > > > > > > > > the
> > > > > > > > > > > > concept of databases and tables so it seems like either
> > >
> > > we
> > > > > >
> > > > > > ignore
> > > > > > > > >
> > > > > > > > > one of
> > > > > > > > > > > > the three Ignite concepts or combine two of them into
> > > >
> > > > database
> > > > > >
> > > > > > or
> > > > > > > > >
> > > > > > > > > table.
> > > > > > > > > > > > The current implementation in the pull request combines
> > > >
> > > > Ignite
> > > > > > > > >
> > > > > > > > > schema and
> > > > > > > > > > > > table attributes into the Spark table attribute.
> > > > > > > > > > > >
> > > > > > > > > > > > Stuart.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > > > > > [2]
> > > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > > > > > >
> > > > > > >
> > > > > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > > > > > [3]
> > > > > > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > > > > > IgniteExternalCatalog.scala
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can you do some research and find out how schema is
> > > >
> > > > handled
> > > > > >
> > > > > > in
> > > > > > > Data
> > > > > > > > > > > >
> > > > > > > > > > > > Frames
> > > > > > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > > > > > >
> > > > > > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko
> > > >
> > > > пишет:
> > > > > > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I see that the 'Table' class (returned by listTables
> > > > > >
> > > > > > method)
> > > > > > > has
> > > > > > > > >
> > > > > > > > > a
> > > > > > > > > > > > >
> > > > > > > > > > > > > 'database' field. Can we use this one to report schema
> > > >
> > > > name?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In any case, I think we should look into how this is
> > > >
> > > > done
> > > > > >
> > > > > > in
> > > > > > > data
> > > > > > > > > > > >
> > > > > > > > > > > > source
> > > > > > > > > > > > > implementations for other databases. Any relational
> > > >
> > > > database
> > > > > > >
> > > > > > > has a
> > > > > > > > > > > >
> > > > > > > > > > > > notion
> > > > > > > > > > > > > of schema, and I'm sure Spark integrations take this
> > >
> > > into
> > > > > > >
> > > > > > > account
> > > > > > > > > > > >
> > > > > > > > > > > > somehow.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Val
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Personally, I think we should change current tables
> > > > > >
> > > > > > naming
> > > > > > > and
> > > > > > > > > > > >
> > > > > > > > > > > > return
> > > > > > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald
> > > >
> > > > пишет:
> > > > > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > While reviewing the changes for IGNITE-9228
> > >
> > > [1,2],
> > > > > > >
> > > > > > > Nikolay
> > > > > > > > >
> > > > > > > > > and I
> > > > > > > > > > > >
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > discussing whether to introduce a change which
> > >
> > > may
> > > > > >
> > > > > > impact
> > > > > > > > > > > >
> > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > compatibility; Nikolay suggested we take the
> > > >
> > > > discussion
> > > > > > >
> > > > > > > to
> > > > > > > > >
> > > > > > > > > this
> > > > > > > > > > > >
> > > > > > > > > > > > list.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Ignite implements a custom Spark catalog which
> > > >
> > > > provides
> > > > > > >
> > > > > > > an
> > > > > > > > >
> > > > > > > > > API by
> > > > > > > > > > > > >
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > Spark users can list the tables which are
> > > >
> > > > available in
> > > > > > >
> > > > > > > Ignite
> > > > > > > > > > > >
> > > > > > > > > > > > which
> > > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > queried via Spark SQL. Currently that table name
> > > >
> > > > list
> > > > > > > > >
> > > > > > > > > includes
> > > > > > > > > > > >
> > > > > > > > > > > > just
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > names of the tables, but IGNITE-9228 is
> > > >
> > > > introducing a
> > > > > > >
> > > > > > > change
> > > > > > > > >
> > > > > > > > > which
> > > > > > > > > > > > >
> > > > > > > > > > > > > allows
> > > > > > > > > > > > > > > > optional prefixing of schema names to table names
> > > >
> > > > to
> > > > > > > > >
> > > > > > > > > disambiguate
> > > > > > > > > > > > >
> > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > > tables with the same name in different schemas.
> > > >
> > > > For the
> > > > > > >
> > > > > > > "list
> > > > > > > > > > > > >
> > > > > > > > > > > > > tables" API
> > > > > > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. List the tables using both their table names
> > >
> > > and
> > > > > > > > > > > >
> > > > > > > > > > > > schema-qualified
> > > > > > > > > > > > > table
> > > > > > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ])
> > >
> > > even
> > > > > >
> > > > > > though
> > > > > > > > >
> > > > > > > > > they are
> > > > > > > > > > > > >
> > > > > > > > > > > > > the same
> > > > > > > > > > > > > > > > underlying table. This retains backwards
> > > >
> > > > compatibility
> > > > > > >
> > > > > > > with
> > > > > > > > >
> > > > > > > > > users
> > > > > > > > > > > >
> > > > > > > > > > > > who
> > > > > > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > > > > > 2. List the tables using only their
> > > >
> > > > schema-qualified
> > > > > > >
> > > > > > > names.
> > > > > > > > >
> > > > > > > > > This
> > > > > > > > > > > > >
> > > > > > > > > > > > > eliminates
> > > > > > > > > > > > > > > > duplication of names in the catalog but will
> > > > > >
> > > > > > potentially
> > > > > > > > >
> > > > > > > > > break
> > > > > > > > > > > > > > > > compatibility with users who expect the table
> > >
> > > name
> > > > in
> > > > > >
> > > > > > the
> > > > > > > > >
> > > > > > > > > catalog.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > With either option we will allow for  Spark SQL
> > > >
> > > > SELECT
> > > > > > > > >
> > > > > > > > > statements
> > > > > > > > > > > >
> > > > > > > > > > > > to
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > either table name or schema-qualified table
> > >
> > > names,
> > > > this
> > > > > > > > >
> > > > > > > > > change
> > > > > > > > > > > >
> > > > > > > > > > > > would
> > > > > > > > > > > > > purely
> > > > > > > > > > > > > > > > impact the API which is used to list available
> > > >
> > > > tables.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Stuart.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc (499 bytes) Download Attachment