Handling slashes in cache names

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RE: Handling slashes in cache names

Stanislav Lukyanov
Hi all ,

I’ve implemented an approach of encoding unsafe characters in the cache names for persistent storage directories. You can find it at https://github.com/gridgain/apache-ignite/tree/ignite-7264.
How it works now is: 1) all characters outside of the [a-zA-Z0-9_-] class are replaced with their hex value (seems to be the easiest way); 2) a hash of the cache name is added at the end of the name to avoid case-insensitive collisions.
There is still a tiny chance of hitting two cache names that are equal ignoring case which also have the same hash, but that’s really unlikely.

It seems that there are no complications with this approach.
The cache name to directory mapping is like
  mycache -> cache-mycache-f19fd83d
  my/cool/cache -> cache-my2fcool2fcache
  my!@#$%^&()cache -> cache-my21402324255e262829cache-84ba3e99

Turns out the persistence is not the only place that doesn’t like special symbols in cache names – I also got an exception from MBean registration when creating a cache with ‘*’ or ‘?’. Filed https://issues.apache.org/jira/browse/IGNITE-7334 for that.

Please let me know if you have any comments.

Thanks,
Stan

From: Stanislav Lukyanov
Sent: 25 декабря 2017 г. 18:09
To: [hidden email]
Subject: Handling slashes in cache names

Hi all,

I’m looking into https://issues.apache.org/jira/browse/IGNITE-7264, and I need some guidance on what’s the best way to approach it.

The problem is that cache names are not restricted, but if persistence is enabled the cache needs to have a corresponding directory on the file system (“cache-…”) which can’t be created if the cache name contains certain characters (or a reserved system name).

A straightforward approach would be to check if a cache name is allowed on the local system (e.g. via `Paths.get(name)`) and fail to create cache if it isn’t, but I’m a bit concerned with the consistency of the behavior (the same cache name be allowed on one system and not on another).
I think a better way would be to replace special characters (say, all non-alphanumeric characters) with underscores in file names (not changing the cache configuration). Would this be OK? Are there any risks I’m not considering?

WDYT?

Thanks,
Stan

Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

dsetrakyan
On Thu, Dec 28, 2017 at 9:22 AM, Stanislav Lukyanov <[hidden email]>
wrote:

> Hi all ,
>
> I’ve implemented an approach of encoding unsafe characters in the cache
> names for persistent storage directories. You can find it at
> https://github.com/gridgain/apache-ignite/tree/ignite-7264.
> How it works now is: 1) all characters outside of the [a-zA-Z0-9_-] class
> are replaced with their hex value (seems to be the easiest way);


I would surround such replacements with "_", e.g. "myCacheName_somesymbol_".


> 2) a hash of the cache name is added at the end of the name to avoid
> case-insensitive collisions.
> There is still a tiny chance of hitting two cache names that are equal
> ignoring case which also have the same hash, but that’s really unlikely.
>

Here I am confused. I think the cache names should be case insensitive at
all times. I seriously doubt enforcing this rule would cause problems. If
we enforce this rule at cache creation time, then we would not have to add
a hashcode at the end.


>
> It seems that there are no complications with this approach.
> The cache name to directory mapping is like
>   mycache -> cache-mycache-f19fd83d
>   my/cool/cache -> cache-my2fcool2fcache
>

As mentioned above, I would prefer "cache-my_2f_cool_2f_cache"


>   my!@#$%^&()cache -> cache-my21402324255e262829cache-84ba3e99
>
> Turns out the persistence is not the only place that doesn’t like special
> symbols in cache names – I also got an exception from MBean registration
> when creating a cache with ‘*’ or ‘?’. Filed https://issues.apache.org/
> jira/browse/IGNITE-7334 for that.
>
> Please let me know if you have any comments.
>
> Thanks,
> Stan
>
> From: Stanislav Lukyanov
> Sent: 25 декабря 2017 г. 18:09
> To: [hidden email]
> Subject: Handling slashes in cache names
>
> Hi all,
>
> I’m looking into https://issues.apache.org/jira/browse/IGNITE-7264, and I
> need some guidance on what’s the best way to approach it.
>
> The problem is that cache names are not restricted, but if persistence is
> enabled the cache needs to have a corresponding directory on the file
> system (“cache-…”) which can’t be created if the cache name contains
> certain characters (or a reserved system name).
>
> A straightforward approach would be to check if a cache name is allowed on
> the local system (e.g. via `Paths.get(name)`) and fail to create cache if
> it isn’t, but I’m a bit concerned with the consistency of the behavior (the
> same cache name be allowed on one system and not on another).
> I think a better way would be to replace special characters (say, all
> non-alphanumeric characters) with underscores in file names (not changing
> the cache configuration). Would this be OK? Are there any risks I’m not
> considering?
>
> WDYT?
>
> Thanks,
> Stan
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Handling slashes in cache names

Stanislav Lukyanov
> I would surround such replacements with "_", e.g. "myCacheName_somesymbol_".
Looks nice, will do.

> Here I am confused. I think the cache names should be case insensitive at
> all times. I seriously doubt enforcing this rule would cause problems. If
> we enforce this rule at cache creation time, then we would not have to add
> a hashcode at the end.
I think I would still keep the hashcode. E.g. I’m now also truncating names longer than 255 chars, and the truncated names could be equal. There could be more edge cases, and adding an imprint of the identity might help to avoid them. The names are readable enough with the hashes, but scary enough for users not to mess with them manually – I guess that’s a good thing :)
Making cache names always case-insensitive sounds good, but I’d separate it to another JIRA issue (it has larger compatibility impact, it affects a different part of the code base, etc). Is it OK?

Thanks,
Stan

From: Dmitriy Setrakyan
Sent: 28 декабря 2017 г. 22:33
To: [hidden email]
Subject: Re: Handling slashes in cache names

On Thu, Dec 28, 2017 at 9:22 AM, Stanislav Lukyanov <[hidden email]>
wrote:

> Hi all ,
>
> I’ve implemented an approach of encoding unsafe characters in the cache
> names for persistent storage directories. You can find it at
> https://github.com/gridgain/apache-ignite/tree/ignite-7264.
> How it works now is: 1) all characters outside of the [a-zA-Z0-9_-] class
> are replaced with their hex value (seems to be the easiest way);


I would surround such replacements with "_", e.g. "myCacheName_somesymbol_".


> 2) a hash of the cache name is added at the end of the name to avoid
> case-insensitive collisions.
> There is still a tiny chance of hitting two cache names that are equal
> ignoring case which also have the same hash, but that’s really unlikely.
>

Here I am confused. I think the cache names should be case insensitive at
all times. I seriously doubt enforcing this rule would cause problems. If
we enforce this rule at cache creation time, then we would not have to add
a hashcode at the end.


>
> It seems that there are no complications with this approach.
> The cache name to directory mapping is like
>   mycache -> cache-mycache-f19fd83d
>   my/cool/cache -> cache-my2fcool2fcache
>

As mentioned above, I would prefer "cache-my_2f_cool_2f_cache"


>   my!@#$%^&()cache -> cache-my21402324255e262829cache-84ba3e99
>
> Turns out the persistence is not the only place that doesn’t like special
> symbols in cache names – I also got an exception from MBean registration
> when creating a cache with ‘*’ or ‘?’. Filed https://issues.apache.org/
> jira/browse/IGNITE-7334 for that.
>
> Please let me know if you have any comments.
>
> Thanks,
> Stan
>
> From: Stanislav Lukyanov
> Sent: 25 декабря 2017 г. 18:09
> To: [hidden email]
> Subject: Handling slashes in cache names
>
> Hi all,
>
> I’m looking into https://issues.apache.org/jira/browse/IGNITE-7264, and I
> need some guidance on what’s the best way to approach it.
>
> The problem is that cache names are not restricted, but if persistence is
> enabled the cache needs to have a corresponding directory on the file
> system (“cache-…”) which can’t be created if the cache name contains
> certain characters (or a reserved system name).
>
> A straightforward approach would be to check if a cache name is allowed on
> the local system (e.g. via `Paths.get(name)`) and fail to create cache if
> it isn’t, but I’m a bit concerned with the consistency of the behavior (the
> same cache name be allowed on one system and not on another).
> I think a better way would be to replace special characters (say, all
> non-alphanumeric characters) with underscores in file names (not changing
> the cache configuration). Would this be OK? Are there any risks I’m not
> considering?
>
> WDYT?
>
> Thanks,
> Stan
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

dsetrakyan
On Fri, Dec 29, 2017 at 2:28 AM, Stanislav Lukyanov <[hidden email]>
wrote:

> > I would surround such replacements with "_", e.g.
> "myCacheName_somesymbol_".
> Looks nice, will do.
>
> > Here I am confused. I think the cache names should be case insensitive at
> > all times. I seriously doubt enforcing this rule would cause problems. If
> > we enforce this rule at cache creation time, then we would not have to
> add
> > a hashcode at the end.
> I think I would still keep the hashcode. E.g. I’m now also truncating
> names longer than 255 chars, and the truncated names could be equal. There
> could be more edge cases, and adding an imprint of the identity might help
> to avoid them. The names are readable enough with the hashes, but scary
> enough for users not to mess with them manually – I guess that’s a good
> thing :)

Making cache names always case-insensitive sounds good, but I’d separate it
> to another JIRA issue (it has larger compatibility impact, it affects a
> different part of the code base, etc). Is it OK?
>

Well, having to support multiple cache name formats going forward will be
difficult. I would rather we finalize on it right now.  My preference would
be to limit to 255 characters right now and make cache names case
insensitive. I doubt such change would affect many users, but it would
definitely make things cleaner.

Would be nice to here what others in the community think. Vladimir O.,
Alexey G.?

D.
Reply | Threaded
Open this post in threaded view
|

RE: Handling slashes in cache names

Stanislav Lukyanov
Let me return back to this issue.

> Well, having to support multiple cache name formats going forward will be
> difficult.
I don’t think there is a question of multiple name formats.
Let’s just say that there are issues that can be solved on the base cache level (e.g. making cache names always case-insensitive)
and there are issues that have to be solved by the PDS (e.g. special and non-ASCII symbols that we don’t want to always ban from names).
I’m not suggesting to introduce anything to PDS that will afterwards be handled by the base cache code. We’ll just handle some issues
first, in PDS, and other issues will be handled separately.

> My preference would be to limit to 255 characters right now
That would be good, but it doesn’t really solve the issue with the length.
Since non-ASCII characters (and non-alphanumeric ASCII) are encoded, the actual length of a cache’s directory name
may be greater than the name of the cache (and don’t forget the “cache-“ prefix).
We could come up with a “really safe” limit, but it might be too small (around 80?), and that would be limiting the API based on a rather arbitrary
Implementation detail.

Another reason why I like to have a hash in the file name is that we might run into problems with
two names, one of which is an escaped version of the other, like “my/cache” and “my_2f_cache”.
And I guess there can be more similar collisions that we just don’t think of right now. Having a hash in the name
just works as a (probabilistic) failsafe for that.

Thanks,
Stan

From: Dmitriy Setrakyan
Sent: 2 января 2018 г. 16:40
To: [hidden email]
Subject: Re: Handling slashes in cache names

On Fri, Dec 29, 2017 at 2:28 AM, Stanislav Lukyanov <[hidden email]>
wrote:

> > I would surround such replacements with "_", e.g.
> "myCacheName_somesymbol_".
> Looks nice, will do.
>
> > Here I am confused. I think the cache names should be case insensitive at
> > all times. I seriously doubt enforcing this rule would cause problems. If
> > we enforce this rule at cache creation time, then we would not have to
> add
> > a hashcode at the end.
> I think I would still keep the hashcode. E.g. I’m now also truncating
> names longer than 255 chars, and the truncated names could be equal. There
> could be more edge cases, and adding an imprint of the identity might help
> to avoid them. The names are readable enough with the hashes, but scary
> enough for users not to mess with them manually – I guess that’s a good
> thing :)

Making cache names always case-insensitive sounds good, but I’d separate it
> to another JIRA issue (it has larger compatibility impact, it affects a
> different part of the code base, etc). Is it OK?
>

Well, having to support multiple cache name formats going forward will be
difficult. I would rather we finalize on it right now.  My preference would
be to limit to 255 characters right now and make cache names case
insensitive. I doubt such change would affect many users, but it would
definitely make things cleaner.

Would be nice to here what others in the community think. Vladimir O.,
Alexey G.?

D.

Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

yzhdanov
Agree that cache names should be case insensitive - currently it seems that
we have issues on Windows OS.

As far as allowed characters - why don't we try creating a directory on all
nodes (but calling toLower() prior to creation)? If creation succeeds
everywhere then cache name is acceptable. New nodes should throw exception
if folder creation is impossible.

I don't like escaping since it will not add any usability for, let's say,
Chinese or Russian names. For example, MySQL supports ASCII:
[0-9,a-z,A-Z$_] (basic Latin letters, digits 0-9, dollar, underscore) and
Extended: U+0080 .. U+FFFF [1]

I also would think over some intersection of allowed file name characters
in different file systems [2]

[1] https://dev.mysql.com/doc/refman/5.7/en/identifiers.html
[2] https://en.wikipedia.org/wiki/Filename

Yakov Zhdanov
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

Pavel Tupitsyn
> try creating a directory on all nodes
And then a new node appears with a different kind of file system..

Escaping removes all limitations and does not affect usability.

Pavel

On Mon, Jan 15, 2018 at 5:47 PM, Yakov Zhdanov <[hidden email]> wrote:

> Agree that cache names should be case insensitive - currently it seems that
> we have issues on Windows OS.
>
> As far as allowed characters - why don't we try creating a directory on all
> nodes (but calling toLower() prior to creation)? If creation succeeds
> everywhere then cache name is acceptable. New nodes should throw exception
> if folder creation is impossible.
>
> I don't like escaping since it will not add any usability for, let's say,
> Chinese or Russian names. For example, MySQL supports ASCII:
> [0-9,a-z,A-Z$_] (basic Latin letters, digits 0-9, dollar, underscore) and
> Extended: U+0080 .. U+FFFF [1]
>
> I also would think over some intersection of allowed file name characters
> in different file systems [2]
>
> [1] https://dev.mysql.com/doc/refman/5.7/en/identifiers.html
> [2] https://en.wikipedia.org/wiki/Filename
>
> Yakov Zhdanov
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

yzhdanov
>> And then a new node appears with a different kind of file system..
This is hardly possible. And I suggest not to

>> Escaping removes all limitations and does not affect usability.
Disagree. You will never ever relate smth like "fdee0456adcc" to
"мои_данные".

Guys, I just realized that we create folder for cache group. How about we
choose group ID for folder name and put text file cachegroup.info
containing group name to it?

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

Pavel Tupitsyn
> You will never ever relate smth like "fdee0456adcc" to "мои_данные".

As a user, why do I need to understand file names in Ignite work directory?

On Mon, Jan 15, 2018 at 6:22 PM, Yakov Zhdanov <[hidden email]> wrote:

> >> And then a new node appears with a different kind of file system..
> This is hardly possible. And I suggest not to
>
> >> Escaping removes all limitations and does not affect usability.
> Disagree. You will never ever relate smth like "fdee0456adcc" to
> "мои_данные".
>
> Guys, I just realized that we create folder for cache group. How about we
> choose group ID for folder name and put text file cachegroup.info
> containing group name to it?
>
> --Yakov
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

yzhdanov
To understand how much storage you need for cache group "X" and watch the
trends.

Anyway, folder named by ID and txt file inside should do the trick =)

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

dsetrakyan
In reply to this post by Pavel Tupitsyn
On Mon, Jan 15, 2018 at 7:11 AM, Pavel Tupitsyn <[hidden email]>
wrote:

> > try creating a directory on all nodes
> And then a new node appears with a different kind of file system..
>

If a new node cannot create an existing cache, it should not be allowed to
start.
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

dsetrakyan
In reply to this post by Pavel Tupitsyn
On Mon, Jan 15, 2018 at 7:31 AM, Pavel Tupitsyn <[hidden email]>
wrote:

> > You will never ever relate smth like "fdee0456adcc" to "мои_данные".
>
> As a user, why do I need to understand file names in Ignite work directory?
>

Because it is better to have an understandable and human readable directory
structure than not. Let's do it right.
Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

Pavel Tupitsyn
>  folder named by ID and txt file inside should do the trick
Agree

On Tue, Jan 16, 2018 at 1:02 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Mon, Jan 15, 2018 at 7:31 AM, Pavel Tupitsyn <[hidden email]>
> wrote:
>
> > > You will never ever relate smth like "fdee0456adcc" to "мои_данные".
> >
> > As a user, why do I need to understand file names in Ignite work
> directory?
> >
>
> Because it is better to have an understandable and human readable directory
> structure than not. Let's do it right.
>
Reply | Threaded
Open this post in threaded view
|

RE: Handling slashes in cache names

Stanislav Lukyanov
How about using both escaping and a text file with the name?
One can think of the escaped name as of a kind of ID, which happens to be human-readable when the name is in ASCII,
and as unreadable as an UUID when the name is in UTF.
This way we have all the readability in the common case (when name is all English letters and digits),
and some limited readability (via looking into text files) when other alphabets are used.

Thanks,
Stan

From: Pavel Tupitsyn
Sent: 16 января 2018 г. 14:01
To: [hidden email]
Subject: Re: Handling slashes in cache names

>  folder named by ID and txt file inside should do the trick
Agree

On Tue, Jan 16, 2018 at 1:02 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Mon, Jan 15, 2018 at 7:31 AM, Pavel Tupitsyn <[hidden email]>
> wrote:
>
> > > You will never ever relate smth like "fdee0456adcc" to "мои_данные".
> >
> > As a user, why do I need to understand file names in Ignite work
> directory?
> >
>
> Because it is better to have an understandable and human readable directory
> structure than not. Let's do it right.
>

Reply | Threaded
Open this post in threaded view
|

Re: Handling slashes in cache names

yzhdanov
>> How about using both escaping and a text file with the name? One can
think of the escaped name as of a kind of ID, which happens to be
human-readable when the name is in ASCII,
and as unreadable as an UUID when the name is in UTF. This way we have all
the readability in the common case (when name is all English letters and
digits), and some limited readability (via looking into text files) when
other alphabets are used.

Sounds good to me.

--Yakov
12