Hi Team,
I tried to run Ignite ML across the dataset with categorical features and came across some problems. My dataset is Mushrooms <https://www.kaggle.com/uciml/mushroom-classification> dataset from Kaggle. There are only categorial features and categorical labels. (so-called classification problem). My attempt you can find in my repo <https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java> . My goal is to make a pipeline which takes raw string values, encodes them to numbers, then train a model. The first problem is the Vectorizer. I started with DummyVectorizer but it supports only Double labels. All other vectorizers have the same issue because all of them are inherited from DefaultLabelVectorizer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36> where Double labels are hardcoded at the generic level. I didn’t find an approach to work with only categorical data with standard Ignite vectorizers. I wrote my own. The second problem. EncoderTrainer (in my case STRING_ENCODER). It doesn’t encode labels. The trainer just ignores labels. See EncoderTrainer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169> . Probably ignoring labels makes sense, but… The third problem. ClassCastException. There are “hidden” (for user) casts labels to Double in model trainers i.e. SVMLinearClassificationTrainer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191>, DiscreteNaiveBayesTrainer etc. Feel free to use my regex \(Double\).*\.label\(\) to search other casts. To sum up, I can say that there are assumptions that labels are numeric values, but if we solve a classification problem, labels can be whatever. But I didn’t find an easy way to preprocess them. If you have any question or need details, feel free to write to me. Best regards, Ravil |
I agree that we should discuss it here more widely
1) Could Label be not double value? (String, for example) 2) Should we extend Encoding for non-Double labels (if we work with non-double values)? 3) Should we validate and reject non-double values on trainers level? (I agree that a lot of double casting is ugly) From my point of view, we should explore scikit-learn and Spark ML about this issues and we shoould 1) support all types in labels and fix things described above by Ravil or 2) remove strange generics and hard-code work with double without casting and etc. and declare our position in documentation First approach costs a lot of time, agree. вт, 11 июн. 2019 г. в 00:29, Ravil Galeyev <[hidden email]>: > Hi Team, > > I tried to run Ignite ML across the dataset with categorical features and > came across some problems. > > My dataset is Mushrooms > <https://www.kaggle.com/uciml/mushroom-classification> dataset from > Kaggle. > There are only categorial features and categorical labels. > > (so-called classification problem). My attempt you can find in my repo > < > https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java > > > . > > My goal is to make a pipeline which takes raw string values, encodes them > to numbers, then train a model. > > The first problem is the Vectorizer. > > I started with DummyVectorizer but it supports only Double labels. > > All other vectorizers have the same issue because all of them are inherited > > from DefaultLabelVectorizer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36 > > > where Double labels are hardcoded at the generic level. > > I didn’t find an approach to work with only categorical data with standard > Ignite vectorizers. I wrote my own. > > The second problem. EncoderTrainer (in my case STRING_ENCODER). > > It doesn’t encode labels. The trainer just ignores labels. See > EncoderTrainer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169 > > > . > > Probably ignoring labels makes sense, but… > > The third problem. ClassCastException. > > There are “hidden” (for user) casts labels to Double in model trainers > > i.e. SVMLinearClassificationTrainer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191 > >, > DiscreteNaiveBayesTrainer etc. > > Feel free to use my regex \(Double\).*\.label\(\) to search other casts. > > To sum up, I can say that there are assumptions that labels are numeric > values, > > but if we solve a classification problem, labels can be whatever. > > But I didn’t find an easy way to preprocess them. > > > > If you have any question or need details, feel free to write to me. > > Best regards, > > Ravil > |
Free forum by Nabble | Edit this page |