[ML][DISCUSSION] Big Double problem

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[ML][DISCUSSION] Big Double problem

Ravil Galeyev
Hi Team,

I tried to run Ignite ML across the dataset with categorical features and
came across some problems.

My dataset is Mushrooms
<https://www.kaggle.com/uciml/mushroom-classification> dataset from Kaggle.
There are only categorial features and categorical labels.

(so-called classification problem). My attempt you can find in my repo
<https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java>
.

My goal is to make a pipeline which takes raw string values, encodes them
to numbers, then train a model.

The first problem is the Vectorizer.

I started with DummyVectorizer but it supports only Double labels.

All other vectorizers have the same issue because all of them are inherited

from DefaultLabelVectorizer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36>
where Double labels are hardcoded at the generic level.

I didn’t find an approach to work with only categorical data with standard
Ignite vectorizers. I wrote my own.

The second problem. EncoderTrainer (in my case STRING_ENCODER).

It doesn’t encode labels. The trainer just ignores labels. See
EncoderTrainer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169>
.

Probably ignoring labels makes sense, but…

The third problem. ClassCastException.

There are “hidden” (for user) casts labels to Double in model trainers

i.e. SVMLinearClassificationTrainer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191>,
DiscreteNaiveBayesTrainer etc.

Feel free to use my regex \(Double\).*\.label\(\) to search other casts.

To sum up, I can say that there are assumptions that labels are numeric
values,

but if we solve a classification problem, labels can be whatever.

But I didn’t find an easy way to preprocess them.



If you have any question or need details, feel free to write to me.

Best regards,

Ravil
Reply | Threaded
Open this post in threaded view
|

Re: [ML][DISCUSSION] Big Double problem

Alexey Zinoviev
I agree that we should discuss it here more widely

1) Could Label be not double value? (String, for example)
2) Should we extend Encoding for non-Double labels (if we work with
non-double values)?
3) Should we validate and reject non-double values on trainers level? (I
agree that a lot of double casting is ugly)

From my point of view, we should explore scikit-learn and Spark ML about
this issues and we shoould
1) support all types in labels and fix things described above by Ravil
or
2) remove strange generics and hard-code work with double without casting
and etc. and declare our position in documentation

First approach costs a lot of time, agree.



вт, 11 июн. 2019 г. в 00:29, Ravil Galeyev <[hidden email]>:

> Hi Team,
>
> I tried to run Ignite ML across the dataset with categorical features and
> came across some problems.
>
> My dataset is Mushrooms
> <https://www.kaggle.com/uciml/mushroom-classification> dataset from
> Kaggle.
> There are only categorial features and categorical labels.
>
> (so-called classification problem). My attempt you can find in my repo
> <
> https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java
> >
> .
>
> My goal is to make a pipeline which takes raw string values, encodes them
> to numbers, then train a model.
>
> The first problem is the Vectorizer.
>
> I started with DummyVectorizer but it supports only Double labels.
>
> All other vectorizers have the same issue because all of them are inherited
>
> from DefaultLabelVectorizer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36
> >
> where Double labels are hardcoded at the generic level.
>
> I didn’t find an approach to work with only categorical data with standard
> Ignite vectorizers. I wrote my own.
>
> The second problem. EncoderTrainer (in my case STRING_ENCODER).
>
> It doesn’t encode labels. The trainer just ignores labels. See
> EncoderTrainer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169
> >
> .
>
> Probably ignoring labels makes sense, but…
>
> The third problem. ClassCastException.
>
> There are “hidden” (for user) casts labels to Double in model trainers
>
> i.e. SVMLinearClassificationTrainer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191
> >,
> DiscreteNaiveBayesTrainer etc.
>
> Feel free to use my regex \(Double\).*\.label\(\) to search other casts.
>
> To sum up, I can say that there are assumptions that labels are numeric
> values,
>
> but if we solve a classification problem, labels can be whatever.
>
> But I didn’t find an easy way to preprocess them.
>
>
>
> If you have any question or need details, feel free to write to me.
>
> Best regards,
>
> Ravil
>