Dear community,
By starting this chain I'd like to open discussion that would come to contribution results in subj. area. Ignite has indexing capabilities, backed up by different mechanisms, including Lucene. Currently, Lucene 7.5.0 is used (past year release). This is a wide spread and mature technology that covers text search area and beyond (e.g. spacial data indexing). My goal is to *expose more Lucene functionality to Ignite indexing and query mechanisms for text data*. It's quite simple request at current stage. It is coming from our project's needs, but i believe, will be useful for a lot more people. Let's walk through and vote or discuss about Jira tickets for them. 1.[trivial] Use dataQuery.getPageSize() to limit search response items inside GridLuceneIndex.query(). Currently it is calling IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all scored matches will me returned, what we do not need in most cases. 2.[simple] Add sorting. Then more capable search call can be executed: *IndexSearcher.search(query, count, sort) * Implementation steps: a) Introduce boolean *sortField* parameter in *@QueryTextFiled * annotation. If *true *the filed will be indexed but not tokenized. Number types are preferred here. b) Add *sort* collection to *TextQuery* constructor. It should define desired sort fields used for querying. c) Implement Lucene sort usage in GridLuceneIndex.query(). 3.[moderate] Build complex queries with *TextQuery*, including terms/queries boosting. *This section for voting only, as requires more detailed work. Should be extended if community is interested in it.* Looking forward to your comments! BR, Yuriy Shuliha |
Hi Yuriy,
Unfortunatelly, there is a plan to discontinue TextQueries in Ignite [1]. Motivation here is text indexes are not persistent, not transactional and can't be user together with SQL or inside SQL. and there is a lack of interest from community side. You are weclome to take on these issues and make TextQueries great. 1, PageSize can't be used to limit resultset. Query results return from data node to client-side cursor in page-by-page manner and this parameter is designed control page size. It is supposed query executes lazily on server side and it is not excepted full resultset be loaded to memory on server side at once, but by pages. Do you mean you found Lucene load entire resultset into memory before first page is sent to client? I'd think a new parameter should be added to limit result. The best solution is to use query language commands for this, e.g. "LIMIT/OFFSET" in SQL. This task doesn't look trivial. Query is distributed operation and same user query will be executed on data nodes and then results from all nodes should be correcly merged before being returned via client-cursor. So, LIMIT should be applied on every node and then on merge phase. Also, this may be non-obviuos, limiting results make no sence without sorting, as there is no guarantee every next query run will return same data because of page reordeing. Basically, merge phase receive results from data nodes asynchronously and messages from different nodes can't be ordered. 2. a. "tokenize" param name (for @QueryTextFiled) looks more verbose, isn't it. b,c. What about distributed query? How partial results from nodes will be merged? Does Lucene allows to configure comparator for data sorting? What comparator Ignite should choose to sort result on merge phase? 3. For now Lucene engine is not configurable at all. E.g. it is impossible to configure Tokenizer. I'd think about possible ways to configure engine at first and only then go further to discuss\implement complex features, that may depends on engine config. On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> wrote: > Dear community, > > By starting this chain I'd like to open discussion that would come to > contribution results in subj. area. > > Ignite has indexing capabilities, backed up by different mechanisms, > including Lucene. > > Currently, Lucene 7.5.0 is used (past year release). > This is a wide spread and mature technology that covers text search area > and beyond (e.g. spacial data indexing). > > My goal is to *expose more Lucene functionality to Ignite indexing and > query mechanisms for text data*. > > It's quite simple request at current stage. It is coming from our project's > needs, but i believe, will be useful for a lot more people. > Let's walk through and vote or discuss about Jira tickets for them. > > 1.[trivial] Use dataQuery.getPageSize() to limit search response items > inside GridLuceneIndex.query(). Currently it is calling > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all scored > matches will me returned, what we do not need in most cases. > > 2.[simple] Add sorting. Then more capable search call can be > executed: *IndexSearcher.search(query, count, > sort) * > Implementation steps: > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > annotation. If > *true *the filed will be indexed but not tokenized. Number types are > preferred here. > b) Add *sort* collection to *TextQuery* constructor. It should define > desired sort fields used for querying. > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > 3.[moderate] Build complex queries with *TextQuery*, including > terms/queries boosting. > *This section for voting only, as requires more detailed work. Should be > extended if community is interested in it.* > > Looking forward to your comments! > > BR, > Yuriy Shuliha > -- Best regards, Andrey V. Mashenkov |
Yuriy,
If you are ready to take over the full-text search indexes then please go ahead. The primary reason why the community wants to discontinue them first (and, probable, resurrect later) are the limitations listed by Andrey and minimal support from the community end. - Denis On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <[hidden email]> wrote: > Hi Yuriy, > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite [1]. > Motivation here is text indexes are not persistent, not transactional and > can't be user together with SQL or inside SQL. > and there is a lack of interest from community side. > You are weclome to take on these issues and make TextQueries great. > > 1, PageSize can't be used to limit resultset. > Query results return from data node to client-side cursor in page-by-page > manner and > this parameter is designed control page size. It is supposed query executes > lazily on server side and > it is not excepted full resultset be loaded to memory on server side at > once, but by pages. > Do you mean you found Lucene load entire resultset into memory before first > page is sent to client? > > I'd think a new parameter should be added to limit result. The best > solution is to use query language commands for this, e.g. "LIMIT/OFFSET" in > SQL. > > This task doesn't look trivial. Query is distributed operation and same > user query will be executed on data nodes > and then results from all nodes should be correcly merged before being > returned via client-cursor. > So, LIMIT should be applied on every node and then on merge phase. > > Also, this may be non-obviuos, limiting results make no sence without > sorting, > as there is no guarantee every next query run will return same data because > of page reordeing. > Basically, merge phase receive results from data nodes asynchronously and > messages from different nodes can't be ordered. > > 2. > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, isn't > it. > b,c. What about distributed query? How partial results from nodes will be > merged? > Does Lucene allows to configure comparator for data sorting? > What comparator Ignite should choose to sort result on merge phase? > > 3. For now Lucene engine is not configurable at all. E.g. it is impossible > to configure Tokenizer. > I'd think about possible ways to configure engine at first and only then go > further to discuss\implement complex features, > that may depends on engine config. > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> wrote: > > > Dear community, > > > > By starting this chain I'd like to open discussion that would come to > > contribution results in subj. area. > > > > Ignite has indexing capabilities, backed up by different mechanisms, > > including Lucene. > > > > Currently, Lucene 7.5.0 is used (past year release). > > This is a wide spread and mature technology that covers text search area > > and beyond (e.g. spacial data indexing). > > > > My goal is to *expose more Lucene functionality to Ignite indexing and > > query mechanisms for text data*. > > > > It's quite simple request at current stage. It is coming from our > project's > > needs, but i believe, will be useful for a lot more people. > > Let's walk through and vote or discuss about Jira tickets for them. > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search response items > > inside GridLuceneIndex.query(). Currently it is calling > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all > scored > > matches will me returned, what we do not need in most cases. > > > > 2.[simple] Add sorting. Then more capable search call can be > > executed: *IndexSearcher.search(query, count, > > sort) * > > Implementation steps: > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > annotation. If > > *true *the filed will be indexed but not tokenized. Number types are > > preferred here. > > b) Add *sort* collection to *TextQuery* constructor. It should define > > desired sort fields used for querying. > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > terms/queries boosting. > > *This section for voting only, as requires more detailed work. Should be > > extended if community is interested in it.* > > > > Looking forward to your comments! > > > > BR, > > Yuriy Shuliha > > > > > -- > Best regards, > Andrey V. Mashenkov > |
Yuriy,
Note what one of major blockers for text queries is [1] which makes lucene indexes unusable with persistence and main reason for discontinuation. Probably it's should be addressed first to make text queries a valid product feature. Distributed sorting and advanved querying is indeed not a trivial task. Some kind of merging must be implemented on query originating node. [1] https://issues.apache.org/jira/browse/IGNITE-5371 чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > Yuriy, > > If you are ready to take over the full-text search indexes then please go > ahead. The primary reason why the community wants to discontinue them first > (and, probable, resurrect later) are the limitations listed by Andrey and > minimal support from the community end. > > - > Denis > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > [hidden email]> > wrote: > > > Hi Yuriy, > > > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite [1]. > > Motivation here is text indexes are not persistent, not transactional and > > can't be user together with SQL or inside SQL. > > and there is a lack of interest from community side. > > You are weclome to take on these issues and make TextQueries great. > > > > 1, PageSize can't be used to limit resultset. > > Query results return from data node to client-side cursor in page-by-page > > manner and > > this parameter is designed control page size. It is supposed query > executes > > lazily on server side and > > it is not excepted full resultset be loaded to memory on server side at > > once, but by pages. > > Do you mean you found Lucene load entire resultset into memory before > first > > page is sent to client? > > > > I'd think a new parameter should be added to limit result. The best > > solution is to use query language commands for this, e.g. "LIMIT/OFFSET" > in > > SQL. > > > > This task doesn't look trivial. Query is distributed operation and same > > user query will be executed on data nodes > > and then results from all nodes should be correcly merged before being > > returned via client-cursor. > > So, LIMIT should be applied on every node and then on merge phase. > > > > Also, this may be non-obviuos, limiting results make no sence without > > sorting, > > as there is no guarantee every next query run will return same data > because > > of page reordeing. > > Basically, merge phase receive results from data nodes asynchronously and > > messages from different nodes can't be ordered. > > > > 2. > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, isn't > > it. > > b,c. What about distributed query? How partial results from nodes will be > > merged? > > Does Lucene allows to configure comparator for data sorting? > > What comparator Ignite should choose to sort result on merge phase? > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > impossible > > to configure Tokenizer. > > I'd think about possible ways to configure engine at first and only then > go > > further to discuss\implement complex features, > > that may depends on engine config. > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> wrote: > > > > > Dear community, > > > > > > By starting this chain I'd like to open discussion that would come to > > > contribution results in subj. area. > > > > > > Ignite has indexing capabilities, backed up by different mechanisms, > > > including Lucene. > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > This is a wide spread and mature technology that covers text search > area > > > and beyond (e.g. spacial data indexing). > > > > > > My goal is to *expose more Lucene functionality to Ignite indexing and > > > query mechanisms for text data*. > > > > > > It's quite simple request at current stage. It is coming from our > > project's > > > needs, but i believe, will be useful for a lot more people. > > > Let's walk through and vote or discuss about Jira tickets for them. > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search response > items > > > inside GridLuceneIndex.query(). Currently it is calling > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all > > scored > > > matches will me returned, what we do not need in most cases. > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > executed: *IndexSearcher.search(query, count, > > > sort) * > > > Implementation steps: > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > annotation. If > > > *true *the filed will be indexed but not tokenized. Number types are > > > preferred here. > > > b) Add *sort* collection to *TextQuery* constructor. It should define > > > desired sort fields used for querying. > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > terms/queries boosting. > > > *This section for voting only, as requires more detailed work. Should > be > > > extended if community is interested in it.* > > > > > > Looking forward to your comments! > > > > > > BR, > > > Yuriy Shuliha > > > > > > > > > -- > > Best regards, > > Andrey V. Mashenkov > > > -- Best regards, Alexei Scherbakov |
Hello to all again,
Thank you for important comments and notes given below! Let me answer and continue the discussion. (I) Overall needs in Lucene indexing Alexei has referenced to https://issues.apache.org/jira/browse/IGNITE-5371 where absence of index persistence was declared as an obstacle to further development. a) This ticket is already closed as not valid.b) There are definite needs (and in our project as well) in just in-memory indexing of selected data. We intend to use search capabilities for fetching limited amount of records that should be used in type-ahead search / suggestions. Not all of the data will be indexed and the are no need in Lucene index to be persistence. Hope this is a wide pattern of text-search usage. (II) Necessary fixes in current implementation. a) Implementation of correct *limit *(*offset* seems to be not required in text-search tasks for now) I have investigated the data flow for distributed text queries. it was simple test prefix query, like 'name'*='ene*'* For now each server-node returns all response records to the client-node and it may contain ~thousands, ~hundred thousands records. Event if we need only first 10-100. Again, all the results are added to queue in GridCacheQueryFutureAdapter in arbitrary order by pages. I did not find here any means to deliver deterministic result. So implementing limit as part of query and (GridCacheQueryRequest) will not change the nature of response but will limit load on nodes and networking. Can we consider to open a ticket for this? (III) Further extension of Lucene API exposition to Ignite a) Sorting The solution for this could be: - Make entities comparable - Add custom comparator to entity - Add annotations to mark sorted fields for Lucene indexing - Use comparators when merging responses or reducing to desired limit on client node. Will require full result set to be loaded into memory. Though can be used for relatively small limits. BR, Yuriy Shuliha пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <[hidden email]> пише: > Yuriy, > > Note what one of major blockers for text queries is [1] which makes lucene > indexes unusable with persistence and main reason for discontinuation. > Probably it's should be addressed first to make text queries a valid > product feature. > > Distributed sorting and advanved querying is indeed not a trivial task. > Some kind of merging must be implemented on query originating node. > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > Yuriy, > > > > If you are ready to take over the full-text search indexes then please go > > ahead. The primary reason why the community wants to discontinue them > first > > (and, probable, resurrect later) are the limitations listed by Andrey and > > minimal support from the community end. > > > > - > > Denis > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > [hidden email]> > > wrote: > > > > > Hi Yuriy, > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite > [1]. > > > Motivation here is text indexes are not persistent, not transactional > and > > > can't be user together with SQL or inside SQL. > > > and there is a lack of interest from community side. > > > You are weclome to take on these issues and make TextQueries great. > > > > > > 1, PageSize can't be used to limit resultset. > > > Query results return from data node to client-side cursor in > page-by-page > > > manner and > > > this parameter is designed control page size. It is supposed query > > executes > > > lazily on server side and > > > it is not excepted full resultset be loaded to memory on server side at > > > once, but by pages. > > > Do you mean you found Lucene load entire resultset into memory before > > first > > > page is sent to client? > > > > > > I'd think a new parameter should be added to limit result. The best > > > solution is to use query language commands for this, e.g. > "LIMIT/OFFSET" > > in > > > SQL. > > > > > > This task doesn't look trivial. Query is distributed operation and same > > > user query will be executed on data nodes > > > and then results from all nodes should be correcly merged before being > > > returned via client-cursor. > > > So, LIMIT should be applied on every node and then on merge phase. > > > > > > Also, this may be non-obviuos, limiting results make no sence without > > > sorting, > > > as there is no guarantee every next query run will return same data > > because > > > of page reordeing. > > > Basically, merge phase receive results from data nodes asynchronously > and > > > messages from different nodes can't be ordered. > > > > > > 2. > > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, > isn't > > > it. > > > b,c. What about distributed query? How partial results from nodes will > be > > > merged? > > > Does Lucene allows to configure comparator for data sorting? > > > What comparator Ignite should choose to sort result on merge phase? > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > impossible > > > to configure Tokenizer. > > > I'd think about possible ways to configure engine at first and only > then > > go > > > further to discuss\implement complex features, > > > that may depends on engine config. > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> > wrote: > > > > > > > Dear community, > > > > > > > > By starting this chain I'd like to open discussion that would come to > > > > contribution results in subj. area. > > > > > > > > Ignite has indexing capabilities, backed up by different mechanisms, > > > > including Lucene. > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > This is a wide spread and mature technology that covers text search > > area > > > > and beyond (e.g. spacial data indexing). > > > > > > > > My goal is to *expose more Lucene functionality to Ignite indexing > and > > > > query mechanisms for text data*. > > > > > > > > It's quite simple request at current stage. It is coming from our > > > project's > > > > needs, but i believe, will be useful for a lot more people. > > > > Let's walk through and vote or discuss about Jira tickets for them. > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search response > > items > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all > > > scored > > > > matches will me returned, what we do not need in most cases. > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > executed: *IndexSearcher.search(query, count, > > > > sort) * > > > > Implementation steps: > > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > > annotation. If > > > > *true *the filed will be indexed but not tokenized. Number types are > > > > preferred here. > > > > b) Add *sort* collection to *TextQuery* constructor. It should define > > > > desired sort fields used for querying. > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > > terms/queries boosting. > > > > *This section for voting only, as requires more detailed work. Should > > be > > > > extended if community is interested in it.* > > > > > > > > Looking forward to your comments! > > > > > > > > BR, > > > > Yuriy Shuliha > > > > > > > > > > > > > -- > > > Best regards, > > > Andrey V. Mashenkov > > > > > > > > -- > > Best regards, > Alexei Scherbakov > |
Igniters,
I see nothing wrong with Yury's proposal in regards full-text search API evolution as long as Yury is ready to push it forward. As for the in-memory mode only, it makes total sense for in-memory data grid deployments when Ignite caches data of an underlying DB like Postgres. As part of the changes, I would simply throw an exception (by default) if the one attempts to use text indices with the native persistence enabled. If the person is ready to live with that limitation that an explicit configuration change is needed to come around the exception. Thoughts? - Denis On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> wrote: > Hello to all again, > > Thank you for important comments and notes given below! > > Let me answer and continue the discussion. > > (I) Overall needs in Lucene indexing > > Alexei has referenced to > https://issues.apache.org/jira/browse/IGNITE-5371 where > absence of index persistence was declared as an obstacle to further > development. > > a) This ticket is already closed as not valid.b) There are definite needs > (and in our project as well) in just in-memory indexing of selected data. > We intend to use search capabilities for fetching limited amount of records > that should be used in type-ahead search / suggestions. > Not all of the data will be indexed and the are no need in Lucene index to > be persistence. Hope this is a wide pattern of text-search usage. > > (II) Necessary fixes in current implementation. > > a) Implementation of correct *limit *(*offset* seems to be not required in > text-search tasks for now) > I have investigated the data flow for distributed text queries. it was > simple test prefix query, like 'name'*='ene*'* > For now each server-node returns all response records to the client-node > and it may contain ~thousands, ~hundred thousands records. > Event if we need only first 10-100. Again, all the results are added to > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > I did not find here any means to deliver deterministic result. > So implementing limit as part of query and (GridCacheQueryRequest) will not > change the nature of response but will limit load on nodes and networking. > > Can we consider to open a ticket for this? > > (III) Further extension of Lucene API exposition to Ignite > > a) Sorting > The solution for this could be: > - Make entities comparable > - Add custom comparator to entity > - Add annotations to mark sorted fields for Lucene indexing > - Use comparators when merging responses or reducing to desired limit on > client node. > Will require full result set to be loaded into memory. Though can be used > for relatively small limits. > BR, > Yuriy Shuliha > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <[hidden email]> > пише: > > > Yuriy, > > > > Note what one of major blockers for text queries is [1] which makes > lucene > > indexes unusable with persistence and main reason for discontinuation. > > Probably it's should be addressed first to make text queries a valid > > product feature. > > > > Distributed sorting and advanved querying is indeed not a trivial task. > > Some kind of merging must be implemented on query originating node. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > Yuriy, > > > > > > If you are ready to take over the full-text search indexes then please > go > > > ahead. The primary reason why the community wants to discontinue them > > first > > > (and, probable, resurrect later) are the limitations listed by Andrey > and > > > minimal support from the community end. > > > > > > - > > > Denis > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > [hidden email]> > > > wrote: > > > > > > > Hi Yuriy, > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite > > [1]. > > > > Motivation here is text indexes are not persistent, not transactional > > and > > > > can't be user together with SQL or inside SQL. > > > > and there is a lack of interest from community side. > > > > You are weclome to take on these issues and make TextQueries great. > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > Query results return from data node to client-side cursor in > > page-by-page > > > > manner and > > > > this parameter is designed control page size. It is supposed query > > > executes > > > > lazily on server side and > > > > it is not excepted full resultset be loaded to memory on server side > at > > > > once, but by pages. > > > > Do you mean you found Lucene load entire resultset into memory before > > > first > > > > page is sent to client? > > > > > > > > I'd think a new parameter should be added to limit result. The best > > > > solution is to use query language commands for this, e.g. > > "LIMIT/OFFSET" > > > in > > > > SQL. > > > > > > > > This task doesn't look trivial. Query is distributed operation and > same > > > > user query will be executed on data nodes > > > > and then results from all nodes should be correcly merged before > being > > > > returned via client-cursor. > > > > So, LIMIT should be applied on every node and then on merge phase. > > > > > > > > Also, this may be non-obviuos, limiting results make no sence without > > > > sorting, > > > > as there is no guarantee every next query run will return same data > > > because > > > > of page reordeing. > > > > Basically, merge phase receive results from data nodes asynchronously > > and > > > > messages from different nodes can't be ordered. > > > > > > > > 2. > > > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, > > isn't > > > > it. > > > > b,c. What about distributed query? How partial results from nodes > will > > be > > > > merged? > > > > Does Lucene allows to configure comparator for data sorting? > > > > What comparator Ignite should choose to sort result on merge phase? > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > > impossible > > > > to configure Tokenizer. > > > > I'd think about possible ways to configure engine at first and only > > then > > > go > > > > further to discuss\implement complex features, > > > > that may depends on engine config. > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> > > wrote: > > > > > > > > > Dear community, > > > > > > > > > > By starting this chain I'd like to open discussion that would come > to > > > > > contribution results in subj. area. > > > > > > > > > > Ignite has indexing capabilities, backed up by different > mechanisms, > > > > > including Lucene. > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > This is a wide spread and mature technology that covers text search > > > area > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite indexing > > and > > > > > query mechanisms for text data*. > > > > > > > > > > It's quite simple request at current stage. It is coming from our > > > > project's > > > > > needs, but i believe, will be useful for a lot more people. > > > > > Let's walk through and vote or discuss about Jira tickets for them. > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search response > > > items > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all > > > > scored > > > > > matches will me returned, what we do not need in most cases. > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > > executed: *IndexSearcher.search(query, count, > > > > > sort) * > > > > > Implementation steps: > > > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > > > annotation. If > > > > > *true *the filed will be indexed but not tokenized. Number types > are > > > > > preferred here. > > > > > b) Add *sort* collection to *TextQuery* constructor. It should > define > > > > > desired sort fields used for querying. > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > > > terms/queries boosting. > > > > > *This section for voting only, as requires more detailed work. > Should > > > be > > > > > extended if community is interested in it.* > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > BR, > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrey V. Mashenkov > > > > > > > > > > > > > -- > > > > Best regards, > > Alexei Scherbakov > > > |
Denis,
I like the idea of throwing an exception for enabled text queries on persistent caches. Also I'm fine with proposed limit for unsorted searches. Yury, please proceed with ticket creation. вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > Igniters, > > I see nothing wrong with Yury's proposal in regards full-text search API > evolution as long as Yury is ready to push it forward. > > As for the in-memory mode only, it makes total sense for in-memory data > grid deployments when Ignite caches data of an underlying DB like Postgres. > As part of the changes, I would simply throw an exception (by default) if > the one attempts to use text indices with the native persistence enabled. > If the person is ready to live with that limitation that an explicit > configuration change is needed to come around the exception. > > Thoughts? > > > - > Denis > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> wrote: > > > Hello to all again, > > > > Thank you for important comments and notes given below! > > > > Let me answer and continue the discussion. > > > > (I) Overall needs in Lucene indexing > > > > Alexei has referenced to > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > absence of index persistence was declared as an obstacle to further > > development. > > > > a) This ticket is already closed as not valid.b) There are definite needs > > (and in our project as well) in just in-memory indexing of selected data. > > We intend to use search capabilities for fetching limited amount of > records > > that should be used in type-ahead search / suggestions. > > Not all of the data will be indexed and the are no need in Lucene index > to > > be persistence. Hope this is a wide pattern of text-search usage. > > > > (II) Necessary fixes in current implementation. > > > > a) Implementation of correct *limit *(*offset* seems to be not required > in > > text-search tasks for now) > > I have investigated the data flow for distributed text queries. it was > > simple test prefix query, like 'name'*='ene*'* > > For now each server-node returns all response records to the client-node > > and it may contain ~thousands, ~hundred thousands records. > > Event if we need only first 10-100. Again, all the results are added to > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > I did not find here any means to deliver deterministic result. > > So implementing limit as part of query and (GridCacheQueryRequest) will > not > > change the nature of response but will limit load on nodes and > networking. > > > > Can we consider to open a ticket for this? > > > > (III) Further extension of Lucene API exposition to Ignite > > > > a) Sorting > > The solution for this could be: > > - Make entities comparable > > - Add custom comparator to entity > > - Add annotations to mark sorted fields for Lucene indexing > > - Use comparators when merging responses or reducing to desired limit on > > client node. > > Will require full result set to be loaded into memory. Though can be used > > for relatively small limits. > > BR, > > Yuriy Shuliha > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > [hidden email]> > > пише: > > > > > Yuriy, > > > > > > Note what one of major blockers for text queries is [1] which makes > > lucene > > > indexes unusable with persistence and main reason for discontinuation. > > > Probably it's should be addressed first to make text queries a valid > > > product feature. > > > > > > Distributed sorting and advanved querying is indeed not a trivial task. > > > Some kind of merging must be implemented on query originating node. > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > Yuriy, > > > > > > > > If you are ready to take over the full-text search indexes then > please > > go > > > > ahead. The primary reason why the community wants to discontinue them > > > first > > > > (and, probable, resurrect later) are the limitations listed by Andrey > > and > > > > minimal support from the community end. > > > > > > > > - > > > > Denis > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > [hidden email]> > > > > wrote: > > > > > > > > > Hi Yuriy, > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in > Ignite > > > [1]. > > > > > Motivation here is text indexes are not persistent, not > transactional > > > and > > > > > can't be user together with SQL or inside SQL. > > > > > and there is a lack of interest from community side. > > > > > You are weclome to take on these issues and make TextQueries great. > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > Query results return from data node to client-side cursor in > > > page-by-page > > > > > manner and > > > > > this parameter is designed control page size. It is supposed query > > > > executes > > > > > lazily on server side and > > > > > it is not excepted full resultset be loaded to memory on server > side > > at > > > > > once, but by pages. > > > > > Do you mean you found Lucene load entire resultset into memory > before > > > > first > > > > > page is sent to client? > > > > > > > > > > I'd think a new parameter should be added to limit result. The best > > > > > solution is to use query language commands for this, e.g. > > > "LIMIT/OFFSET" > > > > in > > > > > SQL. > > > > > > > > > > This task doesn't look trivial. Query is distributed operation and > > same > > > > > user query will be executed on data nodes > > > > > and then results from all nodes should be correcly merged before > > being > > > > > returned via client-cursor. > > > > > So, LIMIT should be applied on every node and then on merge phase. > > > > > > > > > > Also, this may be non-obviuos, limiting results make no sence > without > > > > > sorting, > > > > > as there is no guarantee every next query run will return same data > > > > because > > > > > of page reordeing. > > > > > Basically, merge phase receive results from data nodes > asynchronously > > > and > > > > > messages from different nodes can't be ordered. > > > > > > > > > > 2. > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, > > > isn't > > > > > it. > > > > > b,c. What about distributed query? How partial results from nodes > > will > > > be > > > > > merged? > > > > > Does Lucene allows to configure comparator for data sorting? > > > > > What comparator Ignite should choose to sort result on merge phase? > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > > > impossible > > > > > to configure Tokenizer. > > > > > I'd think about possible ways to configure engine at first and only > > > then > > > > go > > > > > further to discuss\implement complex features, > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> > > > wrote: > > > > > > > > > > > Dear community, > > > > > > > > > > > > By starting this chain I'd like to open discussion that would > come > > to > > > > > > contribution results in subj. area. > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > mechanisms, > > > > > > including Lucene. > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > This is a wide spread and mature technology that covers text > search > > > > area > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > indexing > > > and > > > > > > query mechanisms for text data*. > > > > > > > > > > > > It's quite simple request at current stage. It is coming from our > > > > > project's > > > > > > needs, but i believe, will be useful for a lot more people. > > > > > > Let's walk through and vote or discuss about Jira tickets for > them. > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > response > > > > items > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically > all > > > > > scored > > > > > > matches will me returned, what we do not need in most cases. > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > sort) * > > > > > > Implementation steps: > > > > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > > > > annotation. If > > > > > > *true *the filed will be indexed but not tokenized. Number types > > are > > > > > > preferred here. > > > > > > b) Add *sort* collection to *TextQuery* constructor. It should > > define > > > > > > desired sort fields used for querying. > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > > > > terms/queries boosting. > > > > > > *This section for voting only, as requires more detailed work. > > Should > > > > be > > > > > > extended if community is interested in it.* > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > BR, > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > Alexei Scherbakov > > > > > > |
Yuriy,
Greatly appreciate your interest. Could you please elaborate a little bit about sorting? What tasks does it help to solve and how? It would be great to provide an example. ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov <[hidden email]>: > > Denis, > > I like the idea of throwing an exception for enabled text queries on > persistent caches. > > Also I'm fine with proposed limit for unsorted searches. > > Yury, please proceed with ticket creation. > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > Igniters, > > > > I see nothing wrong with Yury's proposal in regards full-text search API > > evolution as long as Yury is ready to push it forward. > > > > As for the in-memory mode only, it makes total sense for in-memory data > > grid deployments when Ignite caches data of an underlying DB like Postgres. > > As part of the changes, I would simply throw an exception (by default) if > > the one attempts to use text indices with the native persistence enabled. > > If the person is ready to live with that limitation that an explicit > > configuration change is needed to come around the exception. > > > > Thoughts? > > > > > > - > > Denis > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> wrote: > > > > > Hello to all again, > > > > > > Thank you for important comments and notes given below! > > > > > > Let me answer and continue the discussion. > > > > > > (I) Overall needs in Lucene indexing > > > > > > Alexei has referenced to > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > absence of index persistence was declared as an obstacle to further > > > development. > > > > > > a) This ticket is already closed as not valid.b) There are definite needs > > > (and in our project as well) in just in-memory indexing of selected data. > > > We intend to use search capabilities for fetching limited amount of > > records > > > that should be used in type-ahead search / suggestions. > > > Not all of the data will be indexed and the are no need in Lucene index > > to > > > be persistence. Hope this is a wide pattern of text-search usage. > > > > > > (II) Necessary fixes in current implementation. > > > > > > a) Implementation of correct *limit *(*offset* seems to be not required > > in > > > text-search tasks for now) > > > I have investigated the data flow for distributed text queries. it was > > > simple test prefix query, like 'name'*='ene*'* > > > For now each server-node returns all response records to the client-node > > > and it may contain ~thousands, ~hundred thousands records. > > > Event if we need only first 10-100. Again, all the results are added to > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > > I did not find here any means to deliver deterministic result. > > > So implementing limit as part of query and (GridCacheQueryRequest) will > > not > > > change the nature of response but will limit load on nodes and > > networking. > > > > > > Can we consider to open a ticket for this? > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > a) Sorting > > > The solution for this could be: > > > - Make entities comparable > > > - Add custom comparator to entity > > > - Add annotations to mark sorted fields for Lucene indexing > > > - Use comparators when merging responses or reducing to desired limit on > > > client node. > > > Will require full result set to be loaded into memory. Though can be used > > > for relatively small limits. > > > BR, > > > Yuriy Shuliha > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > [hidden email]> > > > пише: > > > > > > > Yuriy, > > > > > > > > Note what one of major blockers for text queries is [1] which makes > > > lucene > > > > indexes unusable with persistence and main reason for discontinuation. > > > > Probably it's should be addressed first to make text queries a valid > > > > product feature. > > > > > > > > Distributed sorting and advanved querying is indeed not a trivial task. > > > > Some kind of merging must be implemented on query originating node. > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > > > Yuriy, > > > > > > > > > > If you are ready to take over the full-text search indexes then > > please > > > go > > > > > ahead. The primary reason why the community wants to discontinue them > > > > first > > > > > (and, probable, resurrect later) are the limitations listed by Andrey > > > and > > > > > minimal support from the community end. > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in > > Ignite > > > > [1]. > > > > > > Motivation here is text indexes are not persistent, not > > transactional > > > > and > > > > > > can't be user together with SQL or inside SQL. > > > > > > and there is a lack of interest from community side. > > > > > > You are weclome to take on these issues and make TextQueries great. > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > Query results return from data node to client-side cursor in > > > > page-by-page > > > > > > manner and > > > > > > this parameter is designed control page size. It is supposed query > > > > > executes > > > > > > lazily on server side and > > > > > > it is not excepted full resultset be loaded to memory on server > > side > > > at > > > > > > once, but by pages. > > > > > > Do you mean you found Lucene load entire resultset into memory > > before > > > > > first > > > > > > page is sent to client? > > > > > > > > > > > > I'd think a new parameter should be added to limit result. The best > > > > > > solution is to use query language commands for this, e.g. > > > > "LIMIT/OFFSET" > > > > > in > > > > > > SQL. > > > > > > > > > > > > This task doesn't look trivial. Query is distributed operation and > > > same > > > > > > user query will be executed on data nodes > > > > > > and then results from all nodes should be correcly merged before > > > being > > > > > > returned via client-cursor. > > > > > > So, LIMIT should be applied on every node and then on merge phase. > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no sence > > without > > > > > > sorting, > > > > > > as there is no guarantee every next query run will return same data > > > > > because > > > > > > of page reordeing. > > > > > > Basically, merge phase receive results from data nodes > > asynchronously > > > > and > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > 2. > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, > > > > isn't > > > > > > it. > > > > > > b,c. What about distributed query? How partial results from nodes > > > will > > > > be > > > > > > merged? > > > > > > Does Lucene allows to configure comparator for data sorting? > > > > > > What comparator Ignite should choose to sort result on merge phase? > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > > > > impossible > > > > > > to configure Tokenizer. > > > > > > I'd think about possible ways to configure engine at first and only > > > > then > > > > > go > > > > > > further to discuss\implement complex features, > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[hidden email]> > > > > wrote: > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that would > > come > > > to > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > > mechanisms, > > > > > > > including Lucene. > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > This is a wide spread and mature technology that covers text > > search > > > > > area > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > > indexing > > > > and > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > It's quite simple request at current stage. It is coming from our > > > > > > project's > > > > > > > needs, but i believe, will be useful for a lot more people. > > > > > > > Let's walk through and vote or discuss about Jira tickets for > > them. > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > > response > > > > > items > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically > > all > > > > > > scored > > > > > > > matches will me returned, what we do not need in most cases. > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > sort) * > > > > > > > Implementation steps: > > > > > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > > > > > annotation. If > > > > > > > *true *the filed will be indexed but not tokenized. Number types > > > are > > > > > > > preferred here. > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It should > > > define > > > > > > > desired sort fields used for querying. > > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > > > > > terms/queries boosting. > > > > > > > *This section for voting only, as requires more detailed work. > > > Should > > > > > be > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > BR, > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > Alexei Scherbakov > > > > > > > > > -- Best regards, Ivan Pavlukhin |
Ivan,
Thank you for interesting question! Text searches (or full text searches) are mostly human-oriented. And the point of user's interest is topmost part of response. Then user can read it, evaluate and use the given records for further purposes. Particularly in our case, we use Ignite for operations with financial data, and there lots of text stuff like assets names, fin. instruments, companies etc. In order to operate with this quickly and reliably, users used to work with text search, type-ahead completions, suggestions. For this purposes we are indexing particular string data in separate caches. Sorting capabilities and response size limitations are very important there. As our API have to provide most relevant information in view of limited size. Now let me comment some Ignite/Lucene perspective. Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already sorted by *score *(relevance). So most relevant documents are on the top. And currently distributed queries responses from different nodes are merged into final query cursor queue in arbitrary way. So in fact we already have the score order ruined here. Also Ignite requests all possible documents from Lucene that is redundant and not good for performance. I'm implementing *limit* parameter to be part of *TextQuery *and have to notice that we still have to add sorting for text queries processing in order to have applicable results. *Limit* parameter itself should improve the part of issues from above, but definitely, sorting by document score at least should be implemented along with limit. This is a pretty short commentary if you still have any questions, please ask, do not hesitate) BR, Yuriy Shuliha чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > Yuriy, > > Greatly appreciate your interest. > > Could you please elaborate a little bit about sorting? What tasks does > it help to solve and how? It would be great to provide an example. > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > [hidden email]>: > > > > Denis, > > > > I like the idea of throwing an exception for enabled text queries on > > persistent caches. > > > > Also I'm fine with proposed limit for unsorted searches. > > > > Yury, please proceed with ticket creation. > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > Igniters, > > > > > > I see nothing wrong with Yury's proposal in regards full-text search > API > > > evolution as long as Yury is ready to push it forward. > > > > > > As for the in-memory mode only, it makes total sense for in-memory data > > > grid deployments when Ignite caches data of an underlying DB like > Postgres. > > > As part of the changes, I would simply throw an exception (by default) > if > > > the one attempts to use text indices with the native persistence > enabled. > > > If the person is ready to live with that limitation that an explicit > > > configuration change is needed to come around the exception. > > > > > > Thoughts? > > > > > > > > > - > > > Denis > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> > wrote: > > > > > > > Hello to all again, > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > Let me answer and continue the discussion. > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > Alexei has referenced to > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > absence of index persistence was declared as an obstacle to further > > > > development. > > > > > > > > a) This ticket is already closed as not valid.b) There are definite > needs > > > > (and in our project as well) in just in-memory indexing of selected > data. > > > > We intend to use search capabilities for fetching limited amount of > > > records > > > > that should be used in type-ahead search / suggestions. > > > > Not all of the data will be indexed and the are no need in Lucene > index > > > to > > > > be persistence. Hope this is a wide pattern of text-search usage. > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be not > required > > > in > > > > text-search tasks for now) > > > > I have investigated the data flow for distributed text queries. it > was > > > > simple test prefix query, like 'name'*='ene*'* > > > > For now each server-node returns all response records to the > client-node > > > > and it may contain ~thousands, ~hundred thousands records. > > > > Event if we need only first 10-100. Again, all the results are added > to > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > > > I did not find here any means to deliver deterministic result. > > > > So implementing limit as part of query and (GridCacheQueryRequest) > will > > > not > > > > change the nature of response but will limit load on nodes and > > > networking. > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > a) Sorting > > > > The solution for this could be: > > > > - Make entities comparable > > > > - Add custom comparator to entity > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > - Use comparators when merging responses or reducing to desired > limit on > > > > client node. > > > > Will require full result set to be loaded into memory. Though can be > used > > > > for relatively small limits. > > > > BR, > > > > Yuriy Shuliha > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > [hidden email]> > > > > пише: > > > > > > > > > Yuriy, > > > > > > > > > > Note what one of major blockers for text queries is [1] which makes > > > > lucene > > > > > indexes unusable with persistence and main reason for > discontinuation. > > > > > Probably it's should be addressed first to make text queries a > valid > > > > > product feature. > > > > > > > > > > Distributed sorting and advanved querying is indeed not a trivial > task. > > > > > Some kind of merging must be implemented on query originating node. > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > If you are ready to take over the full-text search indexes then > > > please > > > > go > > > > > > ahead. The primary reason why the community wants to discontinue > them > > > > > first > > > > > > (and, probable, resurrect later) are the limitations listed by > Andrey > > > > and > > > > > > minimal support from the community end. > > > > > > > > > > > > - > > > > > > Denis > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in > > > Ignite > > > > > [1]. > > > > > > > Motivation here is text indexes are not persistent, not > > > transactional > > > > > and > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > and there is a lack of interest from community side. > > > > > > > You are weclome to take on these issues and make TextQueries > great. > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > Query results return from data node to client-side cursor in > > > > > page-by-page > > > > > > > manner and > > > > > > > this parameter is designed control page size. It is supposed > query > > > > > > executes > > > > > > > lazily on server side and > > > > > > > it is not excepted full resultset be loaded to memory on server > > > side > > > > at > > > > > > > once, but by pages. > > > > > > > Do you mean you found Lucene load entire resultset into memory > > > before > > > > > > first > > > > > > > page is sent to client? > > > > > > > > > > > > > > I'd think a new parameter should be added to limit result. The > best > > > > > > > solution is to use query language commands for this, e.g. > > > > > "LIMIT/OFFSET" > > > > > > in > > > > > > > SQL. > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed operation > and > > > > same > > > > > > > user query will be executed on data nodes > > > > > > > and then results from all nodes should be correcly merged > before > > > > being > > > > > > > returned via client-cursor. > > > > > > > So, LIMIT should be applied on every node and then on merge > phase. > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no sence > > > without > > > > > > > sorting, > > > > > > > as there is no guarantee every next query run will return same > data > > > > > > because > > > > > > > of page reordeing. > > > > > > > Basically, merge phase receive results from data nodes > > > asynchronously > > > > > and > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > 2. > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more > verbose, > > > > > isn't > > > > > > > it. > > > > > > > b,c. What about distributed query? How partial results from > nodes > > > > will > > > > > be > > > > > > > merged? > > > > > > > Does Lucene allows to configure comparator for data sorting? > > > > > > > What comparator Ignite should choose to sort result on merge > phase? > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > > > > > impossible > > > > > > > to configure Tokenizer. > > > > > > > I'd think about possible ways to configure engine at first and > only > > > > > then > > > > > > go > > > > > > > further to discuss\implement complex features, > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that would > > > come > > > > to > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > > > mechanisms, > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > This is a wide spread and mature technology that covers text > > > search > > > > > > area > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > > > indexing > > > > > and > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is coming > from our > > > > > > > project's > > > > > > > > needs, but i believe, will be useful for a lot more people. > > > > > > > > Let's walk through and vote or discuss about Jira tickets for > > > them. > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > > > response > > > > > > items > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > basically > > > all > > > > > > > scored > > > > > > > > matches will me returned, what we do not need in most cases. > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > sort) * > > > > > > > > Implementation steps: > > > > > > > > a) Introduce boolean *sortField* parameter in > *@QueryTextFiled * > > > > > > > > annotation. If > > > > > > > > *true *the filed will be indexed but not tokenized. Number > types > > > > are > > > > > > > > preferred here. > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It > should > > > > define > > > > > > > > desired sort fields used for querying. > > > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > including > > > > > > > > terms/queries boosting. > > > > > > > > *This section for voting only, as requires more detailed > work. > > > > Should > > > > > > be > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > BR, > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > -- > Best regards, > Ivan Pavlukhin > |
Yuriy,
Thank you for providing details! Quite interesting. Yes, we already have support of distributed limit and merging sorted subresults for SQL queries. E.g. ReduceIndexSorted and MergeStreamIterator are used for merging sorted streams. Could you please also clarify about score/relevance? Is it provided by Lucene engine for each query result? I am thinking how to do sorted merge properly in this case. ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > Ivan, > > Thank you for interesting question! > > Text searches (or full text searches) are mostly human-oriented. And the > point of user's interest is topmost part of response. > Then user can read it, evaluate and use the given records for further > purposes. > > Particularly in our case, we use Ignite for operations with financial data, > and there lots of text stuff like assets names, fin. instruments, companies > etc. > In order to operate with this quickly and reliably, users used to work with > text search, type-ahead completions, suggestions. > > For this purposes we are indexing particular string data in separate caches. > > Sorting capabilities and response size limitations are very important > there. As our API have to provide most relevant information in view of > limited size. > > Now let me comment some Ignite/Lucene perspective. > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already > sorted by *score *(relevance). So most relevant documents are on the top. > And currently distributed queries responses from different nodes are merged > into final query cursor queue in arbitrary way. > So in fact we already have the score order ruined here. Also Ignite > requests all possible documents from Lucene that is redundant and not good > for performance. > > I'm implementing *limit* parameter to be part of *TextQuery *and have to > notice that we still have to add sorting for text queries processing in > order to have applicable results. > > *Limit* parameter itself should improve the part of issues from above, but > definitely, sorting by document score at least should be implemented along > with limit. > > This is a pretty short commentary if you still have any questions, please > ask, do not hesitate) > > BR, > Yuriy Shuliha > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > Yuriy, > > > > Greatly appreciate your interest. > > > > Could you please elaborate a little bit about sorting? What tasks does > > it help to solve and how? It would be great to provide an example. > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > [hidden email]>: > > > > > > Denis, > > > > > > I like the idea of throwing an exception for enabled text queries on > > > persistent caches. > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > Yury, please proceed with ticket creation. > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > Igniters, > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text search > > API > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > As for the in-memory mode only, it makes total sense for in-memory data > > > > grid deployments when Ignite caches data of an underlying DB like > > Postgres. > > > > As part of the changes, I would simply throw an exception (by default) > > if > > > > the one attempts to use text indices with the native persistence > > enabled. > > > > If the person is ready to live with that limitation that an explicit > > > > configuration change is needed to come around the exception. > > > > > > > > Thoughts? > > > > > > > > > > > > - > > > > Denis > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> > > wrote: > > > > > > > > > Hello to all again, > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > Alexei has referenced to > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > absence of index persistence was declared as an obstacle to further > > > > > development. > > > > > > > > > > a) This ticket is already closed as not valid.b) There are definite > > needs > > > > > (and in our project as well) in just in-memory indexing of selected > > data. > > > > > We intend to use search capabilities for fetching limited amount of > > > > records > > > > > that should be used in type-ahead search / suggestions. > > > > > Not all of the data will be indexed and the are no need in Lucene > > index > > > > to > > > > > be persistence. Hope this is a wide pattern of text-search usage. > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be not > > required > > > > in > > > > > text-search tasks for now) > > > > > I have investigated the data flow for distributed text queries. it > > was > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > For now each server-node returns all response records to the > > client-node > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > Event if we need only first 10-100. Again, all the results are added > > to > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > > > > I did not find here any means to deliver deterministic result. > > > > > So implementing limit as part of query and (GridCacheQueryRequest) > > will > > > > not > > > > > change the nature of response but will limit load on nodes and > > > > networking. > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > a) Sorting > > > > > The solution for this could be: > > > > > - Make entities comparable > > > > > - Add custom comparator to entity > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > - Use comparators when merging responses or reducing to desired > > limit on > > > > > client node. > > > > > Will require full result set to be loaded into memory. Though can be > > used > > > > > for relatively small limits. > > > > > BR, > > > > > Yuriy Shuliha > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > [hidden email]> > > > > > пише: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > Note what one of major blockers for text queries is [1] which makes > > > > > lucene > > > > > > indexes unusable with persistence and main reason for > > discontinuation. > > > > > > Probably it's should be addressed first to make text queries a > > valid > > > > > > product feature. > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a trivial > > task. > > > > > > Some kind of merging must be implemented on query originating node. > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > If you are ready to take over the full-text search indexes then > > > > please > > > > > go > > > > > > > ahead. The primary reason why the community wants to discontinue > > them > > > > > > first > > > > > > > (and, probable, resurrect later) are the limitations listed by > > Andrey > > > > > and > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > - > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in > > > > Ignite > > > > > > [1]. > > > > > > > > Motivation here is text indexes are not persistent, not > > > > transactional > > > > > > and > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > and there is a lack of interest from community side. > > > > > > > > You are weclome to take on these issues and make TextQueries > > great. > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > Query results return from data node to client-side cursor in > > > > > > page-by-page > > > > > > > > manner and > > > > > > > > this parameter is designed control page size. It is supposed > > query > > > > > > > executes > > > > > > > > lazily on server side and > > > > > > > > it is not excepted full resultset be loaded to memory on server > > > > side > > > > > at > > > > > > > > once, but by pages. > > > > > > > > Do you mean you found Lucene load entire resultset into memory > > > > before > > > > > > > first > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit result. The > > best > > > > > > > > solution is to use query language commands for this, e.g. > > > > > > "LIMIT/OFFSET" > > > > > > > in > > > > > > > > SQL. > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed operation > > and > > > > > same > > > > > > > > user query will be executed on data nodes > > > > > > > > and then results from all nodes should be correcly merged > > before > > > > > being > > > > > > > > returned via client-cursor. > > > > > > > > So, LIMIT should be applied on every node and then on merge > > phase. > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no sence > > > > without > > > > > > > > sorting, > > > > > > > > as there is no guarantee every next query run will return same > > data > > > > > > > because > > > > > > > > of page reordeing. > > > > > > > > Basically, merge phase receive results from data nodes > > > > asynchronously > > > > > > and > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > 2. > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more > > verbose, > > > > > > isn't > > > > > > > > it. > > > > > > > > b,c. What about distributed query? How partial results from > > nodes > > > > > will > > > > > > be > > > > > > > > merged? > > > > > > > > Does Lucene allows to configure comparator for data sorting? > > > > > > > > What comparator Ignite should choose to sort result on merge > > phase? > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > > > > > > > impossible > > > > > > > > to configure Tokenizer. > > > > > > > > I'd think about possible ways to configure engine at first and > > only > > > > > > then > > > > > > > go > > > > > > > > further to discuss\implement complex features, > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that would > > > > come > > > > > to > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > > > > mechanisms, > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > This is a wide spread and mature technology that covers text > > > > search > > > > > > > area > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > > > > indexing > > > > > > and > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is coming > > from our > > > > > > > > project's > > > > > > > > > needs, but i believe, will be useful for a lot more people. > > > > > > > > > Let's walk through and vote or discuss about Jira tickets for > > > > them. > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > > > > response > > > > > > > items > > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > basically > > > > all > > > > > > > > scored > > > > > > > > > matches will me returned, what we do not need in most cases. > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > sort) * > > > > > > > > > Implementation steps: > > > > > > > > > a) Introduce boolean *sortField* parameter in > > *@QueryTextFiled * > > > > > > > > > annotation. If > > > > > > > > > *true *the filed will be indexed but not tokenized. Number > > types > > > > > are > > > > > > > > > preferred here. > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It > > should > > > > > define > > > > > > > > > desired sort fields used for querying. > > > > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > including > > > > > > > > > terms/queries boosting. > > > > > > > > > *This section for voting only, as requires more detailed > > work. > > > > > Should > > > > > > > be > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best regards, > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > -- Best regards, Ivan Pavlukhin |
Yuriy,
I've seen you opening a pull-request with the first changes: https://issues.apache.org/jira/browse/IGNITE-12189 Alex Scherbakov and Ivan are you the right guys to do the review? - Denis On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> wrote: > Yuriy, > > Thank you for providing details! Quite interesting. > > Yes, we already have support of distributed limit and merging sorted > subresults for SQL queries. E.g. ReduceIndexSorted and > MergeStreamIterator are used for merging sorted streams. > > Could you please also clarify about score/relevance? Is it provided by > Lucene engine for each query result? I am thinking how to do sorted > merge properly in this case. > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > Ivan, > > > > Thank you for interesting question! > > > > Text searches (or full text searches) are mostly human-oriented. And the > > point of user's interest is topmost part of response. > > Then user can read it, evaluate and use the given records for further > > purposes. > > > > Particularly in our case, we use Ignite for operations with financial > data, > > and there lots of text stuff like assets names, fin. instruments, > companies > > etc. > > In order to operate with this quickly and reliably, users used to work > with > > text search, type-ahead completions, suggestions. > > > > For this purposes we are indexing particular string data in separate > caches. > > > > Sorting capabilities and response size limitations are very important > > there. As our API have to provide most relevant information in view of > > limited size. > > > > Now let me comment some Ignite/Lucene perspective. > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already > > sorted by *score *(relevance). So most relevant documents are on the top. > > And currently distributed queries responses from different nodes are > merged > > into final query cursor queue in arbitrary way. > > So in fact we already have the score order ruined here. Also Ignite > > requests all possible documents from Lucene that is redundant and not > good > > for performance. > > > > I'm implementing *limit* parameter to be part of *TextQuery *and have to > > notice that we still have to add sorting for text queries processing in > > order to have applicable results. > > > > *Limit* parameter itself should improve the part of issues from above, > but > > definitely, sorting by document score at least should be implemented > along > > with limit. > > > > This is a pretty short commentary if you still have any questions, please > > ask, do not hesitate) > > > > BR, > > Yuriy Shuliha > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > Yuriy, > > > > > > Greatly appreciate your interest. > > > > > > Could you please elaborate a little bit about sorting? What tasks does > > > it help to solve and how? It would be great to provide an example. > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > [hidden email]>: > > > > > > > > Denis, > > > > > > > > I like the idea of throwing an exception for enabled text queries on > > > > persistent caches. > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > Igniters, > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > search > > > API > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > As for the in-memory mode only, it makes total sense for in-memory > data > > > > > grid deployments when Ignite caches data of an underlying DB like > > > Postgres. > > > > > As part of the changes, I would simply throw an exception (by > default) > > > if > > > > > the one attempts to use text indices with the native persistence > > > enabled. > > > > > If the person is ready to live with that limitation that an > explicit > > > > > configuration change is needed to come around the exception. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> > > > wrote: > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > Alexei has referenced to > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > absence of index persistence was declared as an obstacle to > further > > > > > > development. > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > definite > > > needs > > > > > > (and in our project as well) in just in-memory indexing of > selected > > > data. > > > > > > We intend to use search capabilities for fetching limited amount > of > > > > > records > > > > > > that should be used in type-ahead search / suggestions. > > > > > > Not all of the data will be indexed and the are no need in Lucene > > > index > > > > > to > > > > > > be persistence. Hope this is a wide pattern of text-search usage. > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be not > > > required > > > > > in > > > > > > text-search tasks for now) > > > > > > I have investigated the data flow for distributed text queries. > it > > > was > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > For now each server-node returns all response records to the > > > client-node > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > Event if we need only first 10-100. Again, all the results are > added > > > to > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > > > > > I did not find here any means to deliver deterministic result. > > > > > > So implementing limit as part of query and > (GridCacheQueryRequest) > > > will > > > > > not > > > > > > change the nature of response but will limit load on nodes and > > > > > networking. > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > a) Sorting > > > > > > The solution for this could be: > > > > > > - Make entities comparable > > > > > > - Add custom comparator to entity > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > - Use comparators when merging responses or reducing to desired > > > limit on > > > > > > client node. > > > > > > Will require full result set to be loaded into memory. Though > can be > > > used > > > > > > for relatively small limits. > > > > > > BR, > > > > > > Yuriy Shuliha > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > [hidden email]> > > > > > > пише: > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] which > makes > > > > > > lucene > > > > > > > indexes unusable with persistence and main reason for > > > discontinuation. > > > > > > > Probably it's should be addressed first to make text queries a > > > valid > > > > > > > product feature. > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > trivial > > > task. > > > > > > > Some kind of merging must be implemented on query originating > node. > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > If you are ready to take over the full-text search indexes > then > > > > > please > > > > > > go > > > > > > > > ahead. The primary reason why the community wants to > discontinue > > > them > > > > > > > first > > > > > > > > (and, probable, resurrect later) are the limitations listed > by > > > Andrey > > > > > > and > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > - > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries > in > > > > > Ignite > > > > > > > [1]. > > > > > > > > > Motivation here is text indexes are not persistent, not > > > > > transactional > > > > > > > and > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > You are weclome to take on these issues and make > TextQueries > > > great. > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > Query results return from data node to client-side cursor > in > > > > > > > page-by-page > > > > > > > > > manner and > > > > > > > > > this parameter is designed control page size. It is > supposed > > > query > > > > > > > > executes > > > > > > > > > lazily on server side and > > > > > > > > > it is not excepted full resultset be loaded to memory on > server > > > > > side > > > > > > at > > > > > > > > > once, but by pages. > > > > > > > > > Do you mean you found Lucene load entire resultset into > memory > > > > > before > > > > > > > > first > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit result. > The > > > best > > > > > > > > > solution is to use query language commands for this, e.g. > > > > > > > "LIMIT/OFFSET" > > > > > > > > in > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > operation > > > and > > > > > > same > > > > > > > > > user query will be executed on data nodes > > > > > > > > > and then results from all nodes should be correcly merged > > > before > > > > > > being > > > > > > > > > returned via client-cursor. > > > > > > > > > So, LIMIT should be applied on every node and then on merge > > > phase. > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no > sence > > > > > without > > > > > > > > > sorting, > > > > > > > > > as there is no guarantee every next query run will return > same > > > data > > > > > > > > because > > > > > > > > > of page reordeing. > > > > > > > > > Basically, merge phase receive results from data nodes > > > > > asynchronously > > > > > > > and > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more > > > verbose, > > > > > > > isn't > > > > > > > > > it. > > > > > > > > > b,c. What about distributed query? How partial results from > > > nodes > > > > > > will > > > > > > > be > > > > > > > > > merged? > > > > > > > > > Does Lucene allows to configure comparator for data > sorting? > > > > > > > > > What comparator Ignite should choose to sort result on > merge > > > phase? > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. > it is > > > > > > > > impossible > > > > > > > > > to configure Tokenizer. > > > > > > > > > I'd think about possible ways to configure engine at first > and > > > only > > > > > > > then > > > > > > > > go > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that > would > > > > > come > > > > > > to > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > > > > > mechanisms, > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > > This is a wide spread and mature technology that covers > text > > > > > search > > > > > > > > area > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > > > > > indexing > > > > > > > and > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is coming > > > from our > > > > > > > > > project's > > > > > > > > > > needs, but i believe, will be useful for a lot more > people. > > > > > > > > > > Let's walk through and vote or discuss about Jira > tickets for > > > > > them. > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > > > > > response > > > > > > > > items > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > > basically > > > > > all > > > > > > > > > scored > > > > > > > > > > matches will me returned, what we do not need in most > cases. > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call > can be > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > sort) * > > > > > > > > > > Implementation steps: > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > *@QueryTextFiled * > > > > > > > > > > annotation. If > > > > > > > > > > *true *the filed will be indexed but not tokenized. > Number > > > types > > > > > > are > > > > > > > > > > preferred here. > > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It > > > should > > > > > > define > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > c) Implement Lucene sort usage in > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > > including > > > > > > > > > > terms/queries boosting. > > > > > > > > > > *This section for voting only, as requires more detailed > > > work. > > > > > > Should > > > > > > > > be > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best regards, > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Best regards, > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Ivan Pavlukhin > > > > > > > -- > Best regards, > Ivan Pavlukhin > |
In reply to this post by Ivan Pavlukhin
Ivan,
Regarding you question about Lucene search response. *IndexSearcher.search()* always returns result sorted at least by *score *(*relevance*) or by defined *Sort *which includes ordering fields and rules. This means than even for now *GridLunceneIndex* result will be incorrect in case of distributed queries as they are merged in arbitrary way. Under the hood *ScoreDoc* object is used to fetch desired document/record and this class contains *docId* and *score*. So small wrapper with Comparable interface may solve merging of ordered results. BR, Yuriy Shuliha пт, 27 вер. 2019 о 18:48 Павлухин Иван <[hidden email]> пише: > Yuriy, > > Thank you for providing details! Quite interesting. > > Yes, we already have support of distributed limit and merging sorted > subresults for SQL queries. E.g. ReduceIndexSorted and > MergeStreamIterator are used for merging sorted streams. > > Could you please also clarify about score/relevance? Is it provided by > Lucene engine for each query result? I am thinking how to do sorted > merge properly in this case. > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > Ivan, > > > > Thank you for interesting question! > > > > Text searches (or full text searches) are mostly human-oriented. And the > > point of user's interest is topmost part of response. > > Then user can read it, evaluate and use the given records for further > > purposes. > > > > Particularly in our case, we use Ignite for operations with financial > data, > > and there lots of text stuff like assets names, fin. instruments, > companies > > etc. > > In order to operate with this quickly and reliably, users used to work > with > > text search, type-ahead completions, suggestions. > > > > For this purposes we are indexing particular string data in separate > caches. > > > > Sorting capabilities and response size limitations are very important > > there. As our API have to provide most relevant information in view of > > limited size. > > > > Now let me comment some Ignite/Lucene perspective. > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already > > sorted by *score *(relevance). So most relevant documents are on the top. > > And currently distributed queries responses from different nodes are > merged > > into final query cursor queue in arbitrary way. > > So in fact we already have the score order ruined here. Also Ignite > > requests all possible documents from Lucene that is redundant and not > good > > for performance. > > > > I'm implementing *limit* parameter to be part of *TextQuery *and have to > > notice that we still have to add sorting for text queries processing in > > order to have applicable results. > > > > *Limit* parameter itself should improve the part of issues from above, > but > > definitely, sorting by document score at least should be implemented > along > > with limit. > > > > This is a pretty short commentary if you still have any questions, please > > ask, do not hesitate) > > > > BR, > > Yuriy Shuliha > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > Yuriy, > > > > > > Greatly appreciate your interest. > > > > > > Could you please elaborate a little bit about sorting? What tasks does > > > it help to solve and how? It would be great to provide an example. > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > [hidden email]>: > > > > > > > > Denis, > > > > > > > > I like the idea of throwing an exception for enabled text queries on > > > > persistent caches. > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > Igniters, > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > search > > > API > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > As for the in-memory mode only, it makes total sense for in-memory > data > > > > > grid deployments when Ignite caches data of an underlying DB like > > > Postgres. > > > > > As part of the changes, I would simply throw an exception (by > default) > > > if > > > > > the one attempts to use text indices with the native persistence > > > enabled. > > > > > If the person is ready to live with that limitation that an > explicit > > > > > configuration change is needed to come around the exception. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email]> > > > wrote: > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > Alexei has referenced to > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > absence of index persistence was declared as an obstacle to > further > > > > > > development. > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > definite > > > needs > > > > > > (and in our project as well) in just in-memory indexing of > selected > > > data. > > > > > > We intend to use search capabilities for fetching limited amount > of > > > > > records > > > > > > that should be used in type-ahead search / suggestions. > > > > > > Not all of the data will be indexed and the are no need in Lucene > > > index > > > > > to > > > > > > be persistence. Hope this is a wide pattern of text-search usage. > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be not > > > required > > > > > in > > > > > > text-search tasks for now) > > > > > > I have investigated the data flow for distributed text queries. > it > > > was > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > For now each server-node returns all response records to the > > > client-node > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > Event if we need only first 10-100. Again, all the results are > added > > > to > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages. > > > > > > I did not find here any means to deliver deterministic result. > > > > > > So implementing limit as part of query and > (GridCacheQueryRequest) > > > will > > > > > not > > > > > > change the nature of response but will limit load on nodes and > > > > > networking. > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > a) Sorting > > > > > > The solution for this could be: > > > > > > - Make entities comparable > > > > > > - Add custom comparator to entity > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > - Use comparators when merging responses or reducing to desired > > > limit on > > > > > > client node. > > > > > > Will require full result set to be loaded into memory. Though > can be > > > used > > > > > > for relatively small limits. > > > > > > BR, > > > > > > Yuriy Shuliha > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > [hidden email]> > > > > > > пише: > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] which > makes > > > > > > lucene > > > > > > > indexes unusable with persistence and main reason for > > > discontinuation. > > > > > > > Probably it's should be addressed first to make text queries a > > > valid > > > > > > > product feature. > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > trivial > > > task. > > > > > > > Some kind of merging must be implemented on query originating > node. > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > If you are ready to take over the full-text search indexes > then > > > > > please > > > > > > go > > > > > > > > ahead. The primary reason why the community wants to > discontinue > > > them > > > > > > > first > > > > > > > > (and, probable, resurrect later) are the limitations listed > by > > > Andrey > > > > > > and > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > - > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries > in > > > > > Ignite > > > > > > > [1]. > > > > > > > > > Motivation here is text indexes are not persistent, not > > > > > transactional > > > > > > > and > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > You are weclome to take on these issues and make > TextQueries > > > great. > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > Query results return from data node to client-side cursor > in > > > > > > > page-by-page > > > > > > > > > manner and > > > > > > > > > this parameter is designed control page size. It is > supposed > > > query > > > > > > > > executes > > > > > > > > > lazily on server side and > > > > > > > > > it is not excepted full resultset be loaded to memory on > server > > > > > side > > > > > > at > > > > > > > > > once, but by pages. > > > > > > > > > Do you mean you found Lucene load entire resultset into > memory > > > > > before > > > > > > > > first > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit result. > The > > > best > > > > > > > > > solution is to use query language commands for this, e.g. > > > > > > > "LIMIT/OFFSET" > > > > > > > > in > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > operation > > > and > > > > > > same > > > > > > > > > user query will be executed on data nodes > > > > > > > > > and then results from all nodes should be correcly merged > > > before > > > > > > being > > > > > > > > > returned via client-cursor. > > > > > > > > > So, LIMIT should be applied on every node and then on merge > > > phase. > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no > sence > > > > > without > > > > > > > > > sorting, > > > > > > > > > as there is no guarantee every next query run will return > same > > > data > > > > > > > > because > > > > > > > > > of page reordeing. > > > > > > > > > Basically, merge phase receive results from data nodes > > > > > asynchronously > > > > > > > and > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more > > > verbose, > > > > > > > isn't > > > > > > > > > it. > > > > > > > > > b,c. What about distributed query? How partial results from > > > nodes > > > > > > will > > > > > > > be > > > > > > > > > merged? > > > > > > > > > Does Lucene allows to configure comparator for data > sorting? > > > > > > > > > What comparator Ignite should choose to sort result on > merge > > > phase? > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. > it is > > > > > > > > impossible > > > > > > > > > to configure Tokenizer. > > > > > > > > > I'd think about possible ways to configure engine at first > and > > > only > > > > > > > then > > > > > > > > go > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that > would > > > > > come > > > > > > to > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by different > > > > > > mechanisms, > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > > This is a wide spread and mature technology that covers > text > > > > > search > > > > > > > > area > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to Ignite > > > > > indexing > > > > > > > and > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is coming > > > from our > > > > > > > > > project's > > > > > > > > > > needs, but i believe, will be useful for a lot more > people. > > > > > > > > > > Let's walk through and vote or discuss about Jira > tickets for > > > > > them. > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search > > > > > response > > > > > > > > items > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > > basically > > > > > all > > > > > > > > > scored > > > > > > > > > > matches will me returned, what we do not need in most > cases. > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call > can be > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > sort) * > > > > > > > > > > Implementation steps: > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > *@QueryTextFiled * > > > > > > > > > > annotation. If > > > > > > > > > > *true *the filed will be indexed but not tokenized. > Number > > > types > > > > > > are > > > > > > > > > > preferred here. > > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It > > > should > > > > > > define > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > c) Implement Lucene sort usage in > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > > including > > > > > > > > > > terms/queries boosting. > > > > > > > > > > *This section for voting only, as requires more detailed > > > work. > > > > > > Should > > > > > > > > be > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best regards, > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Best regards, > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Ivan Pavlukhin > > > > > > > -- > Best regards, > Ivan Pavlukhin > |
In reply to this post by dmagda
Denis,
Thank you for your attention to this. as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 ticket is still pending review. Do we have a chance to move it forward somehow? BR, Yuriy Shuliha пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > Yuriy, > > I've seen you opening a pull-request with the first changes: > https://issues.apache.org/jira/browse/IGNITE-12189 > > Alex Scherbakov and Ivan are you the right guys to do the review? > > - > Denis > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> wrote: > > > Yuriy, > > > > Thank you for providing details! Quite interesting. > > > > Yes, we already have support of distributed limit and merging sorted > > subresults for SQL queries. E.g. ReduceIndexSorted and > > MergeStreamIterator are used for merging sorted streams. > > > > Could you please also clarify about score/relevance? Is it provided by > > Lucene engine for each query result? I am thinking how to do sorted > > merge properly in this case. > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > Ivan, > > > > > > Thank you for interesting question! > > > > > > Text searches (or full text searches) are mostly human-oriented. And > the > > > point of user's interest is topmost part of response. > > > Then user can read it, evaluate and use the given records for further > > > purposes. > > > > > > Particularly in our case, we use Ignite for operations with financial > > data, > > > and there lots of text stuff like assets names, fin. instruments, > > companies > > > etc. > > > In order to operate with this quickly and reliably, users used to work > > with > > > text search, type-ahead completions, suggestions. > > > > > > For this purposes we are indexing particular string data in separate > > caches. > > > > > > Sorting capabilities and response size limitations are very important > > > there. As our API have to provide most relevant information in view of > > > limited size. > > > > > > Now let me comment some Ignite/Lucene perspective. > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already > > > sorted by *score *(relevance). So most relevant documents are on the > top. > > > And currently distributed queries responses from different nodes are > > merged > > > into final query cursor queue in arbitrary way. > > > So in fact we already have the score order ruined here. Also Ignite > > > requests all possible documents from Lucene that is redundant and not > > good > > > for performance. > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and have > to > > > notice that we still have to add sorting for text queries processing in > > > order to have applicable results. > > > > > > *Limit* parameter itself should improve the part of issues from above, > > but > > > definitely, sorting by document score at least should be implemented > > along > > > with limit. > > > > > > This is a pretty short commentary if you still have any questions, > please > > > ask, do not hesitate) > > > > > > BR, > > > Yuriy Shuliha > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > > > Yuriy, > > > > > > > > Greatly appreciate your interest. > > > > > > > > Could you please elaborate a little bit about sorting? What tasks > does > > > > it help to solve and how? It would be great to provide an example. > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > [hidden email]>: > > > > > > > > > > Denis, > > > > > > > > > > I like the idea of throwing an exception for enabled text queries > on > > > > > persistent caches. > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > Igniters, > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > > search > > > > API > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > in-memory > > data > > > > > > grid deployments when Ignite caches data of an underlying DB like > > > > Postgres. > > > > > > As part of the changes, I would simply throw an exception (by > > default) > > > > if > > > > > > the one attempts to use text indices with the native persistence > > > > enabled. > > > > > > If the person is ready to live with that limitation that an > > explicit > > > > > > configuration change is needed to come around the exception. > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > - > > > > > > Denis > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[hidden email] > > > > > > wrote: > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > absence of index persistence was declared as an obstacle to > > further > > > > > > > development. > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > > definite > > > > needs > > > > > > > (and in our project as well) in just in-memory indexing of > > selected > > > > data. > > > > > > > We intend to use search capabilities for fetching limited > amount > > of > > > > > > records > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > Not all of the data will be indexed and the are no need in > Lucene > > > > index > > > > > > to > > > > > > > be persistence. Hope this is a wide pattern of text-search > usage. > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be not > > > > required > > > > > > in > > > > > > > text-search tasks for now) > > > > > > > I have investigated the data flow for distributed text queries. > > it > > > > was > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > For now each server-node returns all response records to the > > > > client-node > > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > > Event if we need only first 10-100. Again, all the results are > > added > > > > to > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by > pages. > > > > > > > I did not find here any means to deliver deterministic result. > > > > > > > So implementing limit as part of query and > > (GridCacheQueryRequest) > > > > will > > > > > > not > > > > > > > change the nature of response but will limit load on nodes and > > > > > > networking. > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > > > a) Sorting > > > > > > > The solution for this could be: > > > > > > > - Make entities comparable > > > > > > > - Add custom comparator to entity > > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > > - Use comparators when merging responses or reducing to desired > > > > limit on > > > > > > > client node. > > > > > > > Will require full result set to be loaded into memory. Though > > can be > > > > used > > > > > > > for relatively small limits. > > > > > > > BR, > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > [hidden email]> > > > > > > > пише: > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] which > > makes > > > > > > > lucene > > > > > > > > indexes unusable with persistence and main reason for > > > > discontinuation. > > > > > > > > Probably it's should be addressed first to make text queries > a > > > > valid > > > > > > > > product feature. > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > > trivial > > > > task. > > > > > > > > Some kind of merging must be implemented on query originating > > node. > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[hidden email] > >: > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search indexes > > then > > > > > > please > > > > > > > go > > > > > > > > > ahead. The primary reason why the community wants to > > discontinue > > > > them > > > > > > > > first > > > > > > > > > (and, probable, resurrect later) are the limitations listed > > by > > > > Andrey > > > > > > > and > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > - > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > TextQueries > > in > > > > > > Ignite > > > > > > > > [1]. > > > > > > > > > > Motivation here is text indexes are not persistent, not > > > > > > transactional > > > > > > > > and > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > > You are weclome to take on these issues and make > > TextQueries > > > > great. > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > Query results return from data node to client-side cursor > > in > > > > > > > > page-by-page > > > > > > > > > > manner and > > > > > > > > > > this parameter is designed control page size. It is > > supposed > > > > query > > > > > > > > > executes > > > > > > > > > > lazily on server side and > > > > > > > > > > it is not excepted full resultset be loaded to memory on > > server > > > > > > side > > > > > > > at > > > > > > > > > > once, but by pages. > > > > > > > > > > Do you mean you found Lucene load entire resultset into > > memory > > > > > > before > > > > > > > > > first > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > result. > > The > > > > best > > > > > > > > > > solution is to use query language commands for this, e.g. > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > in > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > > operation > > > > and > > > > > > > same > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > and then results from all nodes should be correcly merged > > > > before > > > > > > > being > > > > > > > > > > returned via client-cursor. > > > > > > > > > > So, LIMIT should be applied on every node and then on > merge > > > > phase. > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no > > sence > > > > > > without > > > > > > > > > > sorting, > > > > > > > > > > as there is no guarantee every next query run will return > > same > > > > data > > > > > > > > > because > > > > > > > > > > of page reordeing. > > > > > > > > > > Basically, merge phase receive results from data nodes > > > > > > asynchronously > > > > > > > > and > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more > > > > verbose, > > > > > > > > isn't > > > > > > > > > > it. > > > > > > > > > > b,c. What about distributed query? How partial results > from > > > > nodes > > > > > > > will > > > > > > > > be > > > > > > > > > > merged? > > > > > > > > > > Does Lucene allows to configure comparator for data > > sorting? > > > > > > > > > > What comparator Ignite should choose to sort result on > > merge > > > > phase? > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. > > it is > > > > > > > > > impossible > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > I'd think about possible ways to configure engine at > first > > and > > > > only > > > > > > > > then > > > > > > > > > go > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion that > > would > > > > > > come > > > > > > > to > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > different > > > > > > > mechanisms, > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > > > This is a wide spread and mature technology that covers > > text > > > > > > search > > > > > > > > > area > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to > Ignite > > > > > > indexing > > > > > > > > and > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > coming > > > > from our > > > > > > > > > > project's > > > > > > > > > > > needs, but i believe, will be useful for a lot more > > people. > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > tickets for > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit > search > > > > > > response > > > > > > > > > items > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > > > basically > > > > > > all > > > > > > > > > > scored > > > > > > > > > > > matches will me returned, what we do not need in most > > cases. > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search call > > can be > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > sort) * > > > > > > > > > > > Implementation steps: > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > *@QueryTextFiled * > > > > > > > > > > > annotation. If > > > > > > > > > > > *true *the filed will be indexed but not tokenized. > > Number > > > > types > > > > > > > are > > > > > > > > > > > preferred here. > > > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It > > > > should > > > > > > > define > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > c) Implement Lucene sort usage in > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > > > including > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > *This section for voting only, as requires more > detailed > > > > work. > > > > > > > Should > > > > > > > > > be > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Ivan Pavlukhin > > > > > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > > |
Yuri,
I've done with review. No crime found, but trivial compatibility bug. On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > Denis, > > Thank you for your attention to this. > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 ticket > is still pending review. > Do we have a chance to move it forward somehow? > > BR, > Yuriy Shuliha > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > Yuriy, > > > > I've seen you opening a pull-request with the first changes: > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > - > > Denis > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > wrote: > > > > > Yuriy, > > > > > > Thank you for providing details! Quite interesting. > > > > > > Yes, we already have support of distributed limit and merging sorted > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > MergeStreamIterator are used for merging sorted streams. > > > > > > Could you please also clarify about score/relevance? Is it provided by > > > Lucene engine for each query result? I am thinking how to do sorted > > > merge properly in this case. > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > Ivan, > > > > > > > > Thank you for interesting question! > > > > > > > > Text searches (or full text searches) are mostly human-oriented. And > > the > > > > point of user's interest is topmost part of response. > > > > Then user can read it, evaluate and use the given records for further > > > > purposes. > > > > > > > > Particularly in our case, we use Ignite for operations with financial > > > data, > > > > and there lots of text stuff like assets names, fin. instruments, > > > companies > > > > etc. > > > > In order to operate with this quickly and reliably, users used to > work > > > with > > > > text search, type-ahead completions, suggestions. > > > > > > > > For this purposes we are indexing particular string data in separate > > > caches. > > > > > > > > Sorting capabilities and response size limitations are very important > > > > there. As our API have to provide most relevant information in view > of > > > > limited size. > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > *already > > > > sorted by *score *(relevance). So most relevant documents are on the > > top. > > > > And currently distributed queries responses from different nodes are > > > merged > > > > into final query cursor queue in arbitrary way. > > > > So in fact we already have the score order ruined here. Also Ignite > > > > requests all possible documents from Lucene that is redundant and not > > > good > > > > for performance. > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and have > > to > > > > notice that we still have to add sorting for text queries processing > in > > > > order to have applicable results. > > > > > > > > *Limit* parameter itself should improve the part of issues from > above, > > > but > > > > definitely, sorting by document score at least should be implemented > > > along > > > > with limit. > > > > > > > > This is a pretty short commentary if you still have any questions, > > please > > > > ask, do not hesitate) > > > > > > > > BR, > > > > Yuriy Shuliha > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > > > > > Yuriy, > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > Could you please elaborate a little bit about sorting? What tasks > > does > > > > > it help to solve and how? It would be great to provide an example. > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > [hidden email]>: > > > > > > > > > > > > Denis, > > > > > > > > > > > > I like the idea of throwing an exception for enabled text queries > > on > > > > > > persistent caches. > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > > > search > > > > > API > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > in-memory > > > data > > > > > > > grid deployments when Ignite caches data of an underlying DB > like > > > > > Postgres. > > > > > > > As part of the changes, I would simply throw an exception (by > > > default) > > > > > if > > > > > > > the one attempts to use text indices with the native > persistence > > > > > enabled. > > > > > > > If the person is ready to live with that limitation that an > > > explicit > > > > > > > configuration change is needed to come around the exception. > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > - > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > absence of index persistence was declared as an obstacle to > > > further > > > > > > > > development. > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > > > definite > > > > > needs > > > > > > > > (and in our project as well) in just in-memory indexing of > > > selected > > > > > data. > > > > > > > > We intend to use search capabilities for fetching limited > > amount > > > of > > > > > > > records > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > Not all of the data will be indexed and the are no need in > > Lucene > > > > > index > > > > > > > to > > > > > > > > be persistence. Hope this is a wide pattern of text-search > > usage. > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be > not > > > > > required > > > > > > > in > > > > > > > > text-search tasks for now) > > > > > > > > I have investigated the data flow for distributed text > queries. > > > it > > > > > was > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > For now each server-node returns all response records to the > > > > > client-node > > > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > > > Event if we need only first 10-100. Again, all the results > are > > > added > > > > > to > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by > > pages. > > > > > > > > I did not find here any means to deliver deterministic > result. > > > > > > > > So implementing limit as part of query and > > > (GridCacheQueryRequest) > > > > > will > > > > > > > not > > > > > > > > change the nature of response but will limit load on nodes > and > > > > > > > networking. > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > The solution for this could be: > > > > > > > > - Make entities comparable > > > > > > > > - Add custom comparator to entity > > > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > > > - Use comparators when merging responses or reducing to > desired > > > > > limit on > > > > > > > > client node. > > > > > > > > Will require full result set to be loaded into memory. Though > > > can be > > > > > used > > > > > > > > for relatively small limits. > > > > > > > > BR, > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > [hidden email]> > > > > > > > > пише: > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > which > > > makes > > > > > > > > lucene > > > > > > > > > indexes unusable with persistence and main reason for > > > > > discontinuation. > > > > > > > > > Probably it's should be addressed first to make text > queries > > a > > > > > valid > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > > > trivial > > > > > task. > > > > > > > > > Some kind of merging must be implemented on query > originating > > > node. > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > [hidden email] > > >: > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > indexes > > > then > > > > > > > please > > > > > > > > go > > > > > > > > > > ahead. The primary reason why the community wants to > > > discontinue > > > > > them > > > > > > > > > first > > > > > > > > > > (and, probable, resurrect later) are the limitations > listed > > > by > > > > > Andrey > > > > > > > > and > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > TextQueries > > > in > > > > > > > Ignite > > > > > > > > > [1]. > > > > > > > > > > > Motivation here is text indexes are not persistent, not > > > > > > > transactional > > > > > > > > > and > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > > > You are weclome to take on these issues and make > > > TextQueries > > > > > great. > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > Query results return from data node to client-side > cursor > > > in > > > > > > > > > page-by-page > > > > > > > > > > > manner and > > > > > > > > > > > this parameter is designed control page size. It is > > > supposed > > > > > query > > > > > > > > > > executes > > > > > > > > > > > lazily on server side and > > > > > > > > > > > it is not excepted full resultset be loaded to memory > on > > > server > > > > > > > side > > > > > > > > at > > > > > > > > > > > once, but by pages. > > > > > > > > > > > Do you mean you found Lucene load entire resultset into > > > memory > > > > > > > before > > > > > > > > > > first > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > result. > > > The > > > > > best > > > > > > > > > > > solution is to use query language commands for this, > e.g. > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > in > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > > > operation > > > > > and > > > > > > > > same > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > and then results from all nodes should be correcly > merged > > > > > before > > > > > > > > being > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > So, LIMIT should be applied on every node and then on > > merge > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no > > > sence > > > > > > > without > > > > > > > > > > > sorting, > > > > > > > > > > > as there is no guarantee every next query run will > return > > > same > > > > > data > > > > > > > > > > because > > > > > > > > > > > of page reordeing. > > > > > > > > > > > Basically, merge phase receive results from data nodes > > > > > > > asynchronously > > > > > > > > > and > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks > more > > > > > verbose, > > > > > > > > > isn't > > > > > > > > > > > it. > > > > > > > > > > > b,c. What about distributed query? How partial results > > from > > > > > nodes > > > > > > > > will > > > > > > > > > be > > > > > > > > > > > merged? > > > > > > > > > > > Does Lucene allows to configure comparator for data > > > sorting? > > > > > > > > > > > What comparator Ignite should choose to sort result on > > > merge > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. > E.g. > > > it is > > > > > > > > > > impossible > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > I'd think about possible ways to configure engine at > > first > > > and > > > > > only > > > > > > > > > then > > > > > > > > > > go > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion > that > > > would > > > > > > > come > > > > > > > > to > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > different > > > > > > > > mechanisms, > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > > > > This is a wide spread and mature technology that > covers > > > text > > > > > > > search > > > > > > > > > > area > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to > > Ignite > > > > > > > indexing > > > > > > > > > and > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > coming > > > > > from our > > > > > > > > > > > project's > > > > > > > > > > > > needs, but i believe, will be useful for a lot more > > > people. > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > tickets for > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit > > search > > > > > > > response > > > > > > > > > > items > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > calling > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > > > > basically > > > > > > > all > > > > > > > > > > > scored > > > > > > > > > > > > matches will me returned, what we do not need in most > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > call > > > can be > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > sort) * > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > *@QueryTextFiled * > > > > > > > > > > > > annotation. If > > > > > > > > > > > > *true *the filed will be indexed but not tokenized. > > > Number > > > > > types > > > > > > > > are > > > > > > > > > > > > preferred here. > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. > It > > > > > should > > > > > > > > define > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > c) Implement Lucene sort usage in > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > > > > including > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > *This section for voting only, as requires more > > detailed > > > > > work. > > > > > > > > Should > > > > > > > > > > be > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Best regards, > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Ivan Pavlukhin > > > > > > -- Best regards, Andrey V. Mashenkov |
Yuriy,
Am I getting it right that in your PR if we have a limit N than returned items (at most N) will not be strictly the most relevant ones? E.g. if one node returned N items faster than others but with not so good relevance? чт, 3 окт. 2019 г. в 17:47, Andrey Mashenkov <[hidden email]>: > > Yuri, > > I've done with review. > No crime found, but trivial compatibility bug. > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > > > Denis, > > > > Thank you for your attention to this. > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 ticket > > is still pending review. > > Do we have a chance to move it forward somehow? > > > > BR, > > Yuriy Shuliha > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > > > Yuriy, > > > > > > I've seen you opening a pull-request with the first changes: > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > - > > > Denis > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > > wrote: > > > > > > > Yuriy, > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > Yes, we already have support of distributed limit and merging sorted > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > Could you please also clarify about score/relevance? Is it provided by > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > merge properly in this case. > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > > > Ivan, > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. And > > > the > > > > > point of user's interest is topmost part of response. > > > > > Then user can read it, evaluate and use the given records for further > > > > > purposes. > > > > > > > > > > Particularly in our case, we use Ignite for operations with financial > > > > data, > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > companies > > > > > etc. > > > > > In order to operate with this quickly and reliably, users used to > > work > > > > with > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > For this purposes we are indexing particular string data in separate > > > > caches. > > > > > > > > > > Sorting capabilities and response size limitations are very important > > > > > there. As our API have to provide most relevant information in view > > of > > > > > limited size. > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > *already > > > > > sorted by *score *(relevance). So most relevant documents are on the > > > top. > > > > > And currently distributed queries responses from different nodes are > > > > merged > > > > > into final query cursor queue in arbitrary way. > > > > > So in fact we already have the score order ruined here. Also Ignite > > > > > requests all possible documents from Lucene that is redundant and not > > > > good > > > > > for performance. > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and have > > > to > > > > > notice that we still have to add sorting for text queries processing > > in > > > > > order to have applicable results. > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > above, > > > > but > > > > > definitely, sorting by document score at least should be implemented > > > > along > > > > > with limit. > > > > > > > > > > This is a pretty short commentary if you still have any questions, > > > please > > > > > ask, do not hesitate) > > > > > > > > > > BR, > > > > > Yuriy Shuliha > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What tasks > > > does > > > > > > it help to solve and how? It would be great to provide an example. > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > [hidden email]>: > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text queries > > > on > > > > > > > persistent caches. > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > > > > search > > > > > > API > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > in-memory > > > > data > > > > > > > > grid deployments when Ignite caches data of an underlying DB > > like > > > > > > Postgres. > > > > > > > > As part of the changes, I would simply throw an exception (by > > > > default) > > > > > > if > > > > > > > > the one attempts to use text indices with the native > > persistence > > > > > > enabled. > > > > > > > > If the person is ready to live with that limitation that an > > > > explicit > > > > > > > > configuration change is needed to come around the exception. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > absence of index persistence was declared as an obstacle to > > > > further > > > > > > > > > development. > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > > > > definite > > > > > > needs > > > > > > > > > (and in our project as well) in just in-memory indexing of > > > > selected > > > > > > data. > > > > > > > > > We intend to use search capabilities for fetching limited > > > amount > > > > of > > > > > > > > records > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > Not all of the data will be indexed and the are no need in > > > Lucene > > > > > > index > > > > > > > > to > > > > > > > > > be persistence. Hope this is a wide pattern of text-search > > > usage. > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be > > not > > > > > > required > > > > > > > > in > > > > > > > > > text-search tasks for now) > > > > > > > > > I have investigated the data flow for distributed text > > queries. > > > > it > > > > > > was > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > For now each server-node returns all response records to the > > > > > > client-node > > > > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > > > > Event if we need only first 10-100. Again, all the results > > are > > > > added > > > > > > to > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by > > > pages. > > > > > > > > > I did not find here any means to deliver deterministic > > result. > > > > > > > > > So implementing limit as part of query and > > > > (GridCacheQueryRequest) > > > > > > will > > > > > > > > not > > > > > > > > > change the nature of response but will limit load on nodes > > and > > > > > > > > networking. > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > The solution for this could be: > > > > > > > > > - Make entities comparable > > > > > > > > > - Add custom comparator to entity > > > > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > > > > - Use comparators when merging responses or reducing to > > desired > > > > > > limit on > > > > > > > > > client node. > > > > > > > > > Will require full result set to be loaded into memory. Though > > > > can be > > > > > > used > > > > > > > > > for relatively small limits. > > > > > > > > > BR, > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > [hidden email]> > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > which > > > > makes > > > > > > > > > lucene > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > discontinuation. > > > > > > > > > > Probably it's should be addressed first to make text > > queries > > > a > > > > > > valid > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > > > > trivial > > > > > > task. > > > > > > > > > > Some kind of merging must be implemented on query > > originating > > > > node. > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > [hidden email] > > > >: > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > indexes > > > > then > > > > > > > > please > > > > > > > > > go > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > discontinue > > > > > > them > > > > > > > > > > first > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > listed > > > > by > > > > > > Andrey > > > > > > > > > and > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > [hidden email]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > TextQueries > > > > in > > > > > > > > Ignite > > > > > > > > > > [1]. > > > > > > > > > > > > Motivation here is text indexes are not persistent, not > > > > > > > > transactional > > > > > > > > > > and > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > > > > You are weclome to take on these issues and make > > > > TextQueries > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > Query results return from data node to client-side > > cursor > > > > in > > > > > > > > > > page-by-page > > > > > > > > > > > > manner and > > > > > > > > > > > > this parameter is designed control page size. It is > > > > supposed > > > > > > query > > > > > > > > > > > executes > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > it is not excepted full resultset be loaded to memory > > on > > > > server > > > > > > > > side > > > > > > > > > at > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > Do you mean you found Lucene load entire resultset into > > > > memory > > > > > > > > before > > > > > > > > > > > first > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > result. > > > > The > > > > > > best > > > > > > > > > > > > solution is to use query language commands for this, > > e.g. > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > in > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > > > > operation > > > > > > and > > > > > > > > > same > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > and then results from all nodes should be correcly > > merged > > > > > > before > > > > > > > > > being > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > So, LIMIT should be applied on every node and then on > > > merge > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make no > > > > sence > > > > > > > > without > > > > > > > > > > > > sorting, > > > > > > > > > > > > as there is no guarantee every next query run will > > return > > > > same > > > > > > data > > > > > > > > > > > because > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > Basically, merge phase receive results from data nodes > > > > > > > > asynchronously > > > > > > > > > > and > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks > > more > > > > > > verbose, > > > > > > > > > > isn't > > > > > > > > > > > > it. > > > > > > > > > > > > b,c. What about distributed query? How partial results > > > from > > > > > > nodes > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > > > merged? > > > > > > > > > > > > Does Lucene allows to configure comparator for data > > > > sorting? > > > > > > > > > > > > What comparator Ignite should choose to sort result on > > > > merge > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. > > E.g. > > > > it is > > > > > > > > > > > impossible > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > I'd think about possible ways to configure engine at > > > first > > > > and > > > > > > only > > > > > > > > > > then > > > > > > > > > > > go > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion > > that > > > > would > > > > > > > > come > > > > > > > > > to > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > different > > > > > > > > > mechanisms, > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > > > > > > > > > > > This is a wide spread and mature technology that > > covers > > > > text > > > > > > > > search > > > > > > > > > > > area > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to > > > Ignite > > > > > > > > indexing > > > > > > > > > > and > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > coming > > > > > > from our > > > > > > > > > > > > project's > > > > > > > > > > > > > needs, but i believe, will be useful for a lot more > > > > people. > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > tickets for > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit > > > search > > > > > > > > response > > > > > > > > > > > items > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > calling > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so > > > > > > basically > > > > > > > > all > > > > > > > > > > > > scored > > > > > > > > > > > > > matches will me returned, what we do not need in most > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > call > > > > can be > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > *true *the filed will be indexed but not tokenized. > > > > Number > > > > > > types > > > > > > > > > are > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. > > It > > > > > > should > > > > > > > > > define > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*, > > > > > > including > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > *This section for voting only, as requires more > > > detailed > > > > > > work. > > > > > > > > > Should > > > > > > > > > > > be > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Ivan Pavlukhin > > > > > > > > > > > > -- > Best regards, > Andrey V. Mashenkov -- Best regards, Ivan Pavlukhin |
Ivan,
Yes, your observation is correct. This behavior lasts from the very beginning when Lucene indexing was implemented for distributed queries. Implementation of the *limit* solves the problem of redundant response size. Without it *ALL* off the records are fetched each time; that is not good, especially for loose patterns. In order to solve relevance issue correct sorting should be implemented. Y. пт, 4 жовт. 2019 о 10:45 Ivan Pavlukhin <[hidden email]> пише: > Yuriy, > > Am I getting it right that in your PR if we have a limit N than > returned items (at most N) will not be strictly the most relevant > ones? E.g. if one node returned N items faster than others but with > not so good relevance? > > чт, 3 окт. 2019 г. в 17:47, Andrey Mashenkov <[hidden email]>: > > > > Yuri, > > > > I've done with review. > > No crime found, but trivial compatibility bug. > > > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > > > > > Denis, > > > > > > Thank you for your attention to this. > > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 > ticket > > > is still pending review. > > > Do we have a chance to move it forward somehow? > > > > > > BR, > > > Yuriy Shuliha > > > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > > > > > Yuriy, > > > > > > > > I've seen you opening a pull-request with the first changes: > > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > > > - > > > > Denis > > > > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > > > wrote: > > > > > > > > > Yuriy, > > > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > > > Yes, we already have support of distributed limit and merging > sorted > > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > > > Could you please also clarify about score/relevance? Is it > provided by > > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > > merge properly in this case. > > > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > > > > > Ivan, > > > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. > And > > > > the > > > > > > point of user's interest is topmost part of response. > > > > > > Then user can read it, evaluate and use the given records for > further > > > > > > purposes. > > > > > > > > > > > > Particularly in our case, we use Ignite for operations with > financial > > > > > data, > > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > > companies > > > > > > etc. > > > > > > In order to operate with this quickly and reliably, users used to > > > work > > > > > with > > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > > > For this purposes we are indexing particular string data in > separate > > > > > caches. > > > > > > > > > > > > Sorting capabilities and response size limitations are very > important > > > > > > there. As our API have to provide most relevant information in > view > > > of > > > > > > limited size. > > > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > > *already > > > > > > sorted by *score *(relevance). So most relevant documents are on > the > > > > top. > > > > > > And currently distributed queries responses from different nodes > are > > > > > merged > > > > > > into final query cursor queue in arbitrary way. > > > > > > So in fact we already have the score order ruined here. Also > Ignite > > > > > > requests all possible documents from Lucene that is redundant > and not > > > > > good > > > > > > for performance. > > > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and > have > > > > to > > > > > > notice that we still have to add sorting for text queries > processing > > > in > > > > > > order to have applicable results. > > > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > > above, > > > > > but > > > > > > definitely, sorting by document score at least should be > implemented > > > > > along > > > > > > with limit. > > > > > > > > > > > > This is a pretty short commentary if you still have any > questions, > > > > please > > > > > > ask, do not hesitate) > > > > > > > > > > > > BR, > > > > > > Yuriy Shuliha > > > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> > пише: > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What > tasks > > > > does > > > > > > > it help to solve and how? It would be great to provide an > example. > > > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > > [hidden email]>: > > > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text > queries > > > > on > > > > > > > > persistent caches. > > > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards > full-text > > > > > search > > > > > > > API > > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > > in-memory > > > > > data > > > > > > > > > grid deployments when Ignite caches data of an underlying > DB > > > like > > > > > > > Postgres. > > > > > > > > > As part of the changes, I would simply throw an exception > (by > > > > > default) > > > > > > > if > > > > > > > > > the one attempts to use text indices with the native > > > persistence > > > > > > > enabled. > > > > > > > > > If the person is ready to live with that limitation that an > > > > > explicit > > > > > > > > > configuration change is needed to come around the > exception. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > > absence of index persistence was declared as an obstacle > to > > > > > further > > > > > > > > > > development. > > > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There > are > > > > > definite > > > > > > > needs > > > > > > > > > > (and in our project as well) in just in-memory indexing > of > > > > > selected > > > > > > > data. > > > > > > > > > > We intend to use search capabilities for fetching limited > > > > amount > > > > > of > > > > > > > > > records > > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > > Not all of the data will be indexed and the are no need > in > > > > Lucene > > > > > > > index > > > > > > > > > to > > > > > > > > > > be persistence. Hope this is a wide pattern of > text-search > > > > usage. > > > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to > be > > > not > > > > > > > required > > > > > > > > > in > > > > > > > > > > text-search tasks for now) > > > > > > > > > > I have investigated the data flow for distributed text > > > queries. > > > > > it > > > > > > > was > > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > > For now each server-node returns all response records to > the > > > > > > > client-node > > > > > > > > > > and it may contain ~thousands, ~hundred thousands > records. > > > > > > > > > > Event if we need only first 10-100. Again, all the > results > > > are > > > > > added > > > > > > > to > > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order > by > > > > pages. > > > > > > > > > > I did not find here any means to deliver deterministic > > > result. > > > > > > > > > > So implementing limit as part of query and > > > > > (GridCacheQueryRequest) > > > > > > > will > > > > > > > > > not > > > > > > > > > > change the nature of response but will limit load on > nodes > > > and > > > > > > > > > networking. > > > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to > Ignite > > > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > > The solution for this could be: > > > > > > > > > > - Make entities comparable > > > > > > > > > > - Add custom comparator to entity > > > > > > > > > > - Add annotations to mark sorted fields for Lucene > indexing > > > > > > > > > > - Use comparators when merging responses or reducing to > > > desired > > > > > > > limit on > > > > > > > > > > client node. > > > > > > > > > > Will require full result set to be loaded into memory. > Though > > > > > can be > > > > > > > used > > > > > > > > > > for relatively small limits. > > > > > > > > > > BR, > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > > [hidden email]> > > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > > which > > > > > makes > > > > > > > > > > lucene > > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > > discontinuation. > > > > > > > > > > > Probably it's should be addressed first to make text > > > queries > > > > a > > > > > > > valid > > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed > not a > > > > > trivial > > > > > > > task. > > > > > > > > > > > Some kind of merging must be implemented on query > > > originating > > > > > node. > > > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > > [hidden email] > > > > >: > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > > indexes > > > > > then > > > > > > > > > please > > > > > > > > > > go > > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > > discontinue > > > > > > > them > > > > > > > > > > > first > > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > > listed > > > > > by > > > > > > > Andrey > > > > > > > > > > and > > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > > [hidden email]> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > > TextQueries > > > > > in > > > > > > > > > Ignite > > > > > > > > > > > [1]. > > > > > > > > > > > > > Motivation here is text indexes are not > persistent, not > > > > > > > > > transactional > > > > > > > > > > > and > > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > > and there is a lack of interest from community > side. > > > > > > > > > > > > > You are weclome to take on these issues and make > > > > > TextQueries > > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > > Query results return from data node to client-side > > > cursor > > > > > in > > > > > > > > > > > page-by-page > > > > > > > > > > > > > manner and > > > > > > > > > > > > > this parameter is designed control page size. It is > > > > > supposed > > > > > > > query > > > > > > > > > > > > executes > > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > > it is not excepted full resultset be loaded to > memory > > > on > > > > > server > > > > > > > > > side > > > > > > > > > > at > > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > > Do you mean you found Lucene load entire resultset > into > > > > > memory > > > > > > > > > before > > > > > > > > > > > > first > > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > > result. > > > > > The > > > > > > > best > > > > > > > > > > > > > solution is to use query language commands for > this, > > > e.g. > > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > > in > > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is > distributed > > > > > operation > > > > > > > and > > > > > > > > > > same > > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > > and then results from all nodes should be correcly > > > merged > > > > > > > before > > > > > > > > > > being > > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > > So, LIMIT should be applied on every node and then > on > > > > merge > > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results > make no > > > > > sence > > > > > > > > > without > > > > > > > > > > > > > sorting, > > > > > > > > > > > > > as there is no guarantee every next query run will > > > return > > > > > same > > > > > > > data > > > > > > > > > > > > because > > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > > Basically, merge phase receive results from data > nodes > > > > > > > > > asynchronously > > > > > > > > > > > and > > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) > looks > > > more > > > > > > > verbose, > > > > > > > > > > > isn't > > > > > > > > > > > > > it. > > > > > > > > > > > > > b,c. What about distributed query? How partial > results > > > > from > > > > > > > nodes > > > > > > > > > > will > > > > > > > > > > > be > > > > > > > > > > > > > merged? > > > > > > > > > > > > > Does Lucene allows to configure comparator for > data > > > > > sorting? > > > > > > > > > > > > > What comparator Ignite should choose to sort > result on > > > > > merge > > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at > all. > > > E.g. > > > > > it is > > > > > > > > > > > > impossible > > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > > I'd think about possible ways to configure engine > at > > > > first > > > > > and > > > > > > > only > > > > > > > > > > > then > > > > > > > > > > > > go > > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > > [hidden email]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open > discussion > > > that > > > > > would > > > > > > > > > come > > > > > > > > > > to > > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > > different > > > > > > > > > > mechanisms, > > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year > release). > > > > > > > > > > > > > > This is a wide spread and mature technology that > > > covers > > > > > text > > > > > > > > > search > > > > > > > > > > > > area > > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality > to > > > > Ignite > > > > > > > > > indexing > > > > > > > > > > > and > > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > > coming > > > > > > > from our > > > > > > > > > > > > > project's > > > > > > > > > > > > > > needs, but i believe, will be useful for a lot > more > > > > > people. > > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > > tickets for > > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to > limit > > > > search > > > > > > > > > response > > > > > > > > > > > > items > > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > > calling > > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) > - so > > > > > > > basically > > > > > > > > > all > > > > > > > > > > > > > scored > > > > > > > > > > > > > > matches will me returned, what we do not need in > most > > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > > call > > > > > can be > > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > > *true *the filed will be indexed but not > tokenized. > > > > > Number > > > > > > > types > > > > > > > > > > are > > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* > constructor. > > > It > > > > > > > should > > > > > > > > > > define > > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with > *TextQuery*, > > > > > > > including > > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > > *This section for voting only, as requires more > > > > detailed > > > > > > > work. > > > > > > > > > > Should > > > > > > > > > > > > be > > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > -- > > Best regards, > > Andrey V. Mashenkov > > > > -- > Best regards, > Ivan Pavlukhin > |
Yuriy,
Thank you, fine with it. пт, 4 окт. 2019 г. в 11:01, Yuriy Shuliga <[hidden email]>: > > Ivan, > > Yes, your observation is correct. > > This behavior lasts from the very beginning when Lucene indexing was > implemented for distributed queries. > Implementation of the *limit* solves the problem of redundant response > size. Without it *ALL* off the records are fetched each time; that is not > good, especially for loose patterns. > In order to solve relevance issue correct sorting should be implemented. > > Y. > > пт, 4 жовт. 2019 о 10:45 Ivan Pavlukhin <[hidden email]> пише: > > > Yuriy, > > > > Am I getting it right that in your PR if we have a limit N than > > returned items (at most N) will not be strictly the most relevant > > ones? E.g. if one node returned N items faster than others but with > > not so good relevance? > > > > чт, 3 окт. 2019 г. в 17:47, Andrey Mashenkov <[hidden email]>: > > > > > > Yuri, > > > > > > I've done with review. > > > No crime found, but trivial compatibility bug. > > > > > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > > > > > > > Denis, > > > > > > > > Thank you for your attention to this. > > > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 > > ticket > > > > is still pending review. > > > > Do we have a chance to move it forward somehow? > > > > > > > > BR, > > > > Yuriy Shuliha > > > > > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > > > > > > > Yuriy, > > > > > > > > > > I've seen you opening a pull-request with the first changes: > > > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > > > > wrote: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > > > > > Yes, we already have support of distributed limit and merging > > sorted > > > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > > > > > Could you please also clarify about score/relevance? Is it > > provided by > > > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > > > merge properly in this case. > > > > > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > > > > > > > Ivan, > > > > > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. > > And > > > > > the > > > > > > > point of user's interest is topmost part of response. > > > > > > > Then user can read it, evaluate and use the given records for > > further > > > > > > > purposes. > > > > > > > > > > > > > > Particularly in our case, we use Ignite for operations with > > financial > > > > > > data, > > > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > > > companies > > > > > > > etc. > > > > > > > In order to operate with this quickly and reliably, users used to > > > > work > > > > > > with > > > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > > > > > For this purposes we are indexing particular string data in > > separate > > > > > > caches. > > > > > > > > > > > > > > Sorting capabilities and response size limitations are very > > important > > > > > > > there. As our API have to provide most relevant information in > > view > > > > of > > > > > > > limited size. > > > > > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > > > *already > > > > > > > sorted by *score *(relevance). So most relevant documents are on > > the > > > > > top. > > > > > > > And currently distributed queries responses from different nodes > > are > > > > > > merged > > > > > > > into final query cursor queue in arbitrary way. > > > > > > > So in fact we already have the score order ruined here. Also > > Ignite > > > > > > > requests all possible documents from Lucene that is redundant > > and not > > > > > > good > > > > > > > for performance. > > > > > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and > > have > > > > > to > > > > > > > notice that we still have to add sorting for text queries > > processing > > > > in > > > > > > > order to have applicable results. > > > > > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > > > above, > > > > > > but > > > > > > > definitely, sorting by document score at least should be > > implemented > > > > > > along > > > > > > > with limit. > > > > > > > > > > > > > > This is a pretty short commentary if you still have any > > questions, > > > > > please > > > > > > > ask, do not hesitate) > > > > > > > > > > > > > > BR, > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> > > пише: > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What > > tasks > > > > > does > > > > > > > > it help to solve and how? It would be great to provide an > > example. > > > > > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > > > [hidden email]>: > > > > > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text > > queries > > > > > on > > > > > > > > > persistent caches. > > > > > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards > > full-text > > > > > > search > > > > > > > > API > > > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > > > in-memory > > > > > > data > > > > > > > > > > grid deployments when Ignite caches data of an underlying > > DB > > > > like > > > > > > > > Postgres. > > > > > > > > > > As part of the changes, I would simply throw an exception > > (by > > > > > > default) > > > > > > > > if > > > > > > > > > > the one attempts to use text indices with the native > > > > persistence > > > > > > > > enabled. > > > > > > > > > > If the person is ready to live with that limitation that an > > > > > > explicit > > > > > > > > > > configuration change is needed to come around the > > exception. > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > > > absence of index persistence was declared as an obstacle > > to > > > > > > further > > > > > > > > > > > development. > > > > > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There > > are > > > > > > definite > > > > > > > > needs > > > > > > > > > > > (and in our project as well) in just in-memory indexing > > of > > > > > > selected > > > > > > > > data. > > > > > > > > > > > We intend to use search capabilities for fetching limited > > > > > amount > > > > > > of > > > > > > > > > > records > > > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > > > Not all of the data will be indexed and the are no need > > in > > > > > Lucene > > > > > > > > index > > > > > > > > > > to > > > > > > > > > > > be persistence. Hope this is a wide pattern of > > text-search > > > > > usage. > > > > > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to > > be > > > > not > > > > > > > > required > > > > > > > > > > in > > > > > > > > > > > text-search tasks for now) > > > > > > > > > > > I have investigated the data flow for distributed text > > > > queries. > > > > > > it > > > > > > > > was > > > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > > > For now each server-node returns all response records to > > the > > > > > > > > client-node > > > > > > > > > > > and it may contain ~thousands, ~hundred thousands > > records. > > > > > > > > > > > Event if we need only first 10-100. Again, all the > > results > > > > are > > > > > > added > > > > > > > > to > > > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order > > by > > > > > pages. > > > > > > > > > > > I did not find here any means to deliver deterministic > > > > result. > > > > > > > > > > > So implementing limit as part of query and > > > > > > (GridCacheQueryRequest) > > > > > > > > will > > > > > > > > > > not > > > > > > > > > > > change the nature of response but will limit load on > > nodes > > > > and > > > > > > > > > > networking. > > > > > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to > > Ignite > > > > > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > > > The solution for this could be: > > > > > > > > > > > - Make entities comparable > > > > > > > > > > > - Add custom comparator to entity > > > > > > > > > > > - Add annotations to mark sorted fields for Lucene > > indexing > > > > > > > > > > > - Use comparators when merging responses or reducing to > > > > desired > > > > > > > > limit on > > > > > > > > > > > client node. > > > > > > > > > > > Will require full result set to be loaded into memory. > > Though > > > > > > can be > > > > > > > > used > > > > > > > > > > > for relatively small limits. > > > > > > > > > > > BR, > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > > > [hidden email]> > > > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > > > which > > > > > > makes > > > > > > > > > > > lucene > > > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > > > discontinuation. > > > > > > > > > > > > Probably it's should be addressed first to make text > > > > queries > > > > > a > > > > > > > > valid > > > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed > > not a > > > > > > trivial > > > > > > > > task. > > > > > > > > > > > > Some kind of merging must be implemented on query > > > > originating > > > > > > node. > > > > > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > > > [hidden email] > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > > > indexes > > > > > > then > > > > > > > > > > please > > > > > > > > > > > go > > > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > > > discontinue > > > > > > > > them > > > > > > > > > > > > first > > > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > > > listed > > > > > > by > > > > > > > > Andrey > > > > > > > > > > > and > > > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > > > [hidden email]> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > > > TextQueries > > > > > > in > > > > > > > > > > Ignite > > > > > > > > > > > > [1]. > > > > > > > > > > > > > > Motivation here is text indexes are not > > persistent, not > > > > > > > > > > transactional > > > > > > > > > > > > and > > > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > > > and there is a lack of interest from community > > side. > > > > > > > > > > > > > > You are weclome to take on these issues and make > > > > > > TextQueries > > > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > > > Query results return from data node to client-side > > > > cursor > > > > > > in > > > > > > > > > > > > page-by-page > > > > > > > > > > > > > > manner and > > > > > > > > > > > > > > this parameter is designed control page size. It is > > > > > > supposed > > > > > > > > query > > > > > > > > > > > > > executes > > > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > > > it is not excepted full resultset be loaded to > > memory > > > > on > > > > > > server > > > > > > > > > > side > > > > > > > > > > > at > > > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > > > Do you mean you found Lucene load entire resultset > > into > > > > > > memory > > > > > > > > > > before > > > > > > > > > > > > > first > > > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > > > result. > > > > > > The > > > > > > > > best > > > > > > > > > > > > > > solution is to use query language commands for > > this, > > > > e.g. > > > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > > > in > > > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is > > distributed > > > > > > operation > > > > > > > > and > > > > > > > > > > > same > > > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > > > and then results from all nodes should be correcly > > > > merged > > > > > > > > before > > > > > > > > > > > being > > > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > > > So, LIMIT should be applied on every node and then > > on > > > > > merge > > > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results > > make no > > > > > > sence > > > > > > > > > > without > > > > > > > > > > > > > > sorting, > > > > > > > > > > > > > > as there is no guarantee every next query run will > > > > return > > > > > > same > > > > > > > > data > > > > > > > > > > > > > because > > > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > > > Basically, merge phase receive results from data > > nodes > > > > > > > > > > asynchronously > > > > > > > > > > > > and > > > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) > > looks > > > > more > > > > > > > > verbose, > > > > > > > > > > > > isn't > > > > > > > > > > > > > > it. > > > > > > > > > > > > > > b,c. What about distributed query? How partial > > results > > > > > from > > > > > > > > nodes > > > > > > > > > > > will > > > > > > > > > > > > be > > > > > > > > > > > > > > merged? > > > > > > > > > > > > > > Does Lucene allows to configure comparator for > > data > > > > > > sorting? > > > > > > > > > > > > > > What comparator Ignite should choose to sort > > result on > > > > > > merge > > > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at > > all. > > > > E.g. > > > > > > it is > > > > > > > > > > > > > impossible > > > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > > > I'd think about possible ways to configure engine > > at > > > > > first > > > > > > and > > > > > > > > only > > > > > > > > > > > > then > > > > > > > > > > > > > go > > > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > > > [hidden email]> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open > > discussion > > > > that > > > > > > would > > > > > > > > > > come > > > > > > > > > > > to > > > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > > > different > > > > > > > > > > > mechanisms, > > > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year > > release). > > > > > > > > > > > > > > > This is a wide spread and mature technology that > > > > covers > > > > > > text > > > > > > > > > > search > > > > > > > > > > > > > area > > > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality > > to > > > > > Ignite > > > > > > > > > > indexing > > > > > > > > > > > > and > > > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > > > coming > > > > > > > > from our > > > > > > > > > > > > > > project's > > > > > > > > > > > > > > > needs, but i believe, will be useful for a lot > > more > > > > > > people. > > > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > > > tickets for > > > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to > > limit > > > > > search > > > > > > > > > > response > > > > > > > > > > > > > items > > > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > > > calling > > > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) > > - so > > > > > > > > basically > > > > > > > > > > all > > > > > > > > > > > > > > scored > > > > > > > > > > > > > > > matches will me returned, what we do not need in > > most > > > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > > > call > > > > > > can be > > > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > > > *true *the filed will be indexed but not > > tokenized. > > > > > > Number > > > > > > > > types > > > > > > > > > > > are > > > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* > > constructor. > > > > It > > > > > > > > should > > > > > > > > > > > define > > > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with > > *TextQuery*, > > > > > > > > including > > > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > > > *This section for voting only, as requires more > > > > > detailed > > > > > > > > work. > > > > > > > > > > > Should > > > > > > > > > > > > > be > > > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Andrey V. Mashenkov > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > -- Best regards, Ivan Pavlukhin |
In reply to this post by Andrew Mashenkov
Andrew,
I have corrected PR according to your notes. Please review. What will be the next steps in order to merge in? Y. чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <[hidden email]> пише: > Yuri, > > I've done with review. > No crime found, but trivial compatibility bug. > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > > > Denis, > > > > Thank you for your attention to this. > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 > ticket > > is still pending review. > > Do we have a chance to move it forward somehow? > > > > BR, > > Yuriy Shuliha > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > > > Yuriy, > > > > > > I've seen you opening a pull-request with the first changes: > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > - > > > Denis > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > > wrote: > > > > > > > Yuriy, > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > Yes, we already have support of distributed limit and merging sorted > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > Could you please also clarify about score/relevance? Is it provided > by > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > merge properly in this case. > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > > > Ivan, > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. > And > > > the > > > > > point of user's interest is topmost part of response. > > > > > Then user can read it, evaluate and use the given records for > further > > > > > purposes. > > > > > > > > > > Particularly in our case, we use Ignite for operations with > financial > > > > data, > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > companies > > > > > etc. > > > > > In order to operate with this quickly and reliably, users used to > > work > > > > with > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > For this purposes we are indexing particular string data in > separate > > > > caches. > > > > > > > > > > Sorting capabilities and response size limitations are very > important > > > > > there. As our API have to provide most relevant information in view > > of > > > > > limited size. > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > *already > > > > > sorted by *score *(relevance). So most relevant documents are on > the > > > top. > > > > > And currently distributed queries responses from different nodes > are > > > > merged > > > > > into final query cursor queue in arbitrary way. > > > > > So in fact we already have the score order ruined here. Also Ignite > > > > > requests all possible documents from Lucene that is redundant and > not > > > > good > > > > > for performance. > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and > have > > > to > > > > > notice that we still have to add sorting for text queries > processing > > in > > > > > order to have applicable results. > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > above, > > > > but > > > > > definitely, sorting by document score at least should be > implemented > > > > along > > > > > with limit. > > > > > > > > > > This is a pretty short commentary if you still have any questions, > > > please > > > > > ask, do not hesitate) > > > > > > > > > > BR, > > > > > Yuriy Shuliha > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> пише: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What tasks > > > does > > > > > > it help to solve and how? It would be great to provide an > example. > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > [hidden email]>: > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text > queries > > > on > > > > > > > persistent caches. > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards full-text > > > > search > > > > > > API > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > in-memory > > > > data > > > > > > > > grid deployments when Ignite caches data of an underlying DB > > like > > > > > > Postgres. > > > > > > > > As part of the changes, I would simply throw an exception (by > > > > default) > > > > > > if > > > > > > > > the one attempts to use text indices with the native > > persistence > > > > > > enabled. > > > > > > > > If the person is ready to live with that limitation that an > > > > explicit > > > > > > > > configuration change is needed to come around the exception. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > absence of index persistence was declared as an obstacle to > > > > further > > > > > > > > > development. > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There are > > > > definite > > > > > > needs > > > > > > > > > (and in our project as well) in just in-memory indexing of > > > > selected > > > > > > data. > > > > > > > > > We intend to use search capabilities for fetching limited > > > amount > > > > of > > > > > > > > records > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > Not all of the data will be indexed and the are no need in > > > Lucene > > > > > > index > > > > > > > > to > > > > > > > > > be persistence. Hope this is a wide pattern of text-search > > > usage. > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to be > > not > > > > > > required > > > > > > > > in > > > > > > > > > text-search tasks for now) > > > > > > > > > I have investigated the data flow for distributed text > > queries. > > > > it > > > > > > was > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > For now each server-node returns all response records to > the > > > > > > client-node > > > > > > > > > and it may contain ~thousands, ~hundred thousands records. > > > > > > > > > Event if we need only first 10-100. Again, all the results > > are > > > > added > > > > > > to > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by > > > pages. > > > > > > > > > I did not find here any means to deliver deterministic > > result. > > > > > > > > > So implementing limit as part of query and > > > > (GridCacheQueryRequest) > > > > > > will > > > > > > > > not > > > > > > > > > change the nature of response but will limit load on nodes > > and > > > > > > > > networking. > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to Ignite > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > The solution for this could be: > > > > > > > > > - Make entities comparable > > > > > > > > > - Add custom comparator to entity > > > > > > > > > - Add annotations to mark sorted fields for Lucene indexing > > > > > > > > > - Use comparators when merging responses or reducing to > > desired > > > > > > limit on > > > > > > > > > client node. > > > > > > > > > Will require full result set to be loaded into memory. > Though > > > > can be > > > > > > used > > > > > > > > > for relatively small limits. > > > > > > > > > BR, > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > [hidden email]> > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > which > > > > makes > > > > > > > > > lucene > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > discontinuation. > > > > > > > > > > Probably it's should be addressed first to make text > > queries > > > a > > > > > > valid > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed not a > > > > trivial > > > > > > task. > > > > > > > > > > Some kind of merging must be implemented on query > > originating > > > > node. > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > [hidden email] > > > >: > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > indexes > > > > then > > > > > > > > please > > > > > > > > > go > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > discontinue > > > > > > them > > > > > > > > > > first > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > listed > > > > by > > > > > > Andrey > > > > > > > > > and > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > [hidden email]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > TextQueries > > > > in > > > > > > > > Ignite > > > > > > > > > > [1]. > > > > > > > > > > > > Motivation here is text indexes are not persistent, > not > > > > > > > > transactional > > > > > > > > > > and > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > and there is a lack of interest from community side. > > > > > > > > > > > > You are weclome to take on these issues and make > > > > TextQueries > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > Query results return from data node to client-side > > cursor > > > > in > > > > > > > > > > page-by-page > > > > > > > > > > > > manner and > > > > > > > > > > > > this parameter is designed control page size. It is > > > > supposed > > > > > > query > > > > > > > > > > > executes > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > it is not excepted full resultset be loaded to memory > > on > > > > server > > > > > > > > side > > > > > > > > > at > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > Do you mean you found Lucene load entire resultset > into > > > > memory > > > > > > > > before > > > > > > > > > > > first > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > result. > > > > The > > > > > > best > > > > > > > > > > > > solution is to use query language commands for this, > > e.g. > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > in > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is distributed > > > > operation > > > > > > and > > > > > > > > > same > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > and then results from all nodes should be correcly > > merged > > > > > > before > > > > > > > > > being > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > So, LIMIT should be applied on every node and then on > > > merge > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results make > no > > > > sence > > > > > > > > without > > > > > > > > > > > > sorting, > > > > > > > > > > > > as there is no guarantee every next query run will > > return > > > > same > > > > > > data > > > > > > > > > > > because > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > Basically, merge phase receive results from data > nodes > > > > > > > > asynchronously > > > > > > > > > > and > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks > > more > > > > > > verbose, > > > > > > > > > > isn't > > > > > > > > > > > > it. > > > > > > > > > > > > b,c. What about distributed query? How partial > results > > > from > > > > > > nodes > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > > > merged? > > > > > > > > > > > > Does Lucene allows to configure comparator for data > > > > sorting? > > > > > > > > > > > > What comparator Ignite should choose to sort result > on > > > > merge > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at all. > > E.g. > > > > it is > > > > > > > > > > > impossible > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > I'd think about possible ways to configure engine at > > > first > > > > and > > > > > > only > > > > > > > > > > then > > > > > > > > > > > go > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open discussion > > that > > > > would > > > > > > > > come > > > > > > > > > to > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > different > > > > > > > > > mechanisms, > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year > release). > > > > > > > > > > > > > This is a wide spread and mature technology that > > covers > > > > text > > > > > > > > search > > > > > > > > > > > area > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality to > > > Ignite > > > > > > > > indexing > > > > > > > > > > and > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > coming > > > > > > from our > > > > > > > > > > > > project's > > > > > > > > > > > > > needs, but i believe, will be useful for a lot more > > > > people. > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > tickets for > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit > > > search > > > > > > > > response > > > > > > > > > > > items > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > calling > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - > so > > > > > > basically > > > > > > > > all > > > > > > > > > > > > scored > > > > > > > > > > > > > matches will me returned, what we do not need in > most > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > call > > > > can be > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > *true *the filed will be indexed but not tokenized. > > > > Number > > > > > > types > > > > > > > > > are > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* > constructor. > > It > > > > > > should > > > > > > > > > define > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with > *TextQuery*, > > > > > > including > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > *This section for voting only, as requires more > > > detailed > > > > > > work. > > > > > > > > > Should > > > > > > > > > > > be > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Ivan Pavlukhin > > > > > > > > > > > > -- > Best regards, > Andrey V. Mashenkov > |
Yuriy,
Just FYI we have a review checklist [1], coding guidelines [2]. To test a PR someone can use TeamCity [3] or TeamCityBot project [4]. The last way (using TCBot) makes test validation much easier and do not bother with flacky tests. Long story short you can trigger tests for the PR from Bot page and then make Bot attach these results to a Jira ticket if you found results acceptable. So, next step is to run tests and chek if all is ok. [1] https://cwiki.apache.org/confluence/display/IGNITE/Review+Checklist [2] https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines [3] https://ci.ignite.apache.org/ [4] https://mtcga.gridgain.com/ <https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines#CodingGuidelines-TODOs> On Fri, Oct 4, 2019 at 3:10 PM Yuriy Shuliga <[hidden email]> wrote: > Andrew, > > I have corrected PR according to your notes. Please review. > What will be the next steps in order to merge in? > > Y. > > чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <[hidden email]> > пише: > > > Yuri, > > > > I've done with review. > > No crime found, but trivial compatibility bug. > > > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: > > > > > Denis, > > > > > > Thank you for your attention to this. > > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 > > ticket > > > is still pending review. > > > Do we have a chance to move it forward somehow? > > > > > > BR, > > > Yuriy Shuliha > > > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: > > > > > > > Yuriy, > > > > > > > > I've seen you opening a pull-request with the first changes: > > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > > > - > > > > Denis > > > > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> > > > wrote: > > > > > > > > > Yuriy, > > > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > > > Yes, we already have support of distributed limit and merging > sorted > > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > > > Could you please also clarify about score/relevance? Is it provided > > by > > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > > merge properly in this case. > > > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: > > > > > > > > > > > > Ivan, > > > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. > > And > > > > the > > > > > > point of user's interest is topmost part of response. > > > > > > Then user can read it, evaluate and use the given records for > > further > > > > > > purposes. > > > > > > > > > > > > Particularly in our case, we use Ignite for operations with > > financial > > > > > data, > > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > > companies > > > > > > etc. > > > > > > In order to operate with this quickly and reliably, users used to > > > work > > > > > with > > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > > > For this purposes we are indexing particular string data in > > separate > > > > > caches. > > > > > > > > > > > > Sorting capabilities and response size limitations are very > > important > > > > > > there. As our API have to provide most relevant information in > view > > > of > > > > > > limited size. > > > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > > *already > > > > > > sorted by *score *(relevance). So most relevant documents are on > > the > > > > top. > > > > > > And currently distributed queries responses from different nodes > > are > > > > > merged > > > > > > into final query cursor queue in arbitrary way. > > > > > > So in fact we already have the score order ruined here. Also > Ignite > > > > > > requests all possible documents from Lucene that is redundant and > > not > > > > > good > > > > > > for performance. > > > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and > > have > > > > to > > > > > > notice that we still have to add sorting for text queries > > processing > > > in > > > > > > order to have applicable results. > > > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > > above, > > > > > but > > > > > > definitely, sorting by document score at least should be > > implemented > > > > > along > > > > > > with limit. > > > > > > > > > > > > This is a pretty short commentary if you still have any > questions, > > > > please > > > > > > ask, do not hesitate) > > > > > > > > > > > > BR, > > > > > > Yuriy Shuliha > > > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> > пише: > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What > tasks > > > > does > > > > > > > it help to solve and how? It would be great to provide an > > example. > > > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > > [hidden email]>: > > > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text > > queries > > > > on > > > > > > > > persistent caches. > > > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email]>: > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards > full-text > > > > > search > > > > > > > API > > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > > in-memory > > > > > data > > > > > > > > > grid deployments when Ignite caches data of an underlying > DB > > > like > > > > > > > Postgres. > > > > > > > > > As part of the changes, I would simply throw an exception > (by > > > > > default) > > > > > > > if > > > > > > > > > the one attempts to use text indices with the native > > > persistence > > > > > > > enabled. > > > > > > > > > If the person is ready to live with that limitation that an > > > > > explicit > > > > > > > > > configuration change is needed to come around the > exception. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > > absence of index persistence was declared as an obstacle > to > > > > > further > > > > > > > > > > development. > > > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There > are > > > > > definite > > > > > > > needs > > > > > > > > > > (and in our project as well) in just in-memory indexing > of > > > > > selected > > > > > > > data. > > > > > > > > > > We intend to use search capabilities for fetching limited > > > > amount > > > > > of > > > > > > > > > records > > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > > Not all of the data will be indexed and the are no need > in > > > > Lucene > > > > > > > index > > > > > > > > > to > > > > > > > > > > be persistence. Hope this is a wide pattern of > text-search > > > > usage. > > > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to > be > > > not > > > > > > > required > > > > > > > > > in > > > > > > > > > > text-search tasks for now) > > > > > > > > > > I have investigated the data flow for distributed text > > > queries. > > > > > it > > > > > > > was > > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > > For now each server-node returns all response records to > > the > > > > > > > client-node > > > > > > > > > > and it may contain ~thousands, ~hundred thousands > records. > > > > > > > > > > Event if we need only first 10-100. Again, all the > results > > > are > > > > > added > > > > > > > to > > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order > by > > > > pages. > > > > > > > > > > I did not find here any means to deliver deterministic > > > result. > > > > > > > > > > So implementing limit as part of query and > > > > > (GridCacheQueryRequest) > > > > > > > will > > > > > > > > > not > > > > > > > > > > change the nature of response but will limit load on > nodes > > > and > > > > > > > > > networking. > > > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to > Ignite > > > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > > The solution for this could be: > > > > > > > > > > - Make entities comparable > > > > > > > > > > - Add custom comparator to entity > > > > > > > > > > - Add annotations to mark sorted fields for Lucene > indexing > > > > > > > > > > - Use comparators when merging responses or reducing to > > > desired > > > > > > > limit on > > > > > > > > > > client node. > > > > > > > > > > Will require full result set to be loaded into memory. > > Though > > > > > can be > > > > > > > used > > > > > > > > > > for relatively small limits. > > > > > > > > > > BR, > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > > [hidden email]> > > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > > which > > > > > makes > > > > > > > > > > lucene > > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > > discontinuation. > > > > > > > > > > > Probably it's should be addressed first to make text > > > queries > > > > a > > > > > > > valid > > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed > not a > > > > > trivial > > > > > > > task. > > > > > > > > > > > Some kind of merging must be implemented on query > > > originating > > > > > node. > > > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > > [hidden email] > > > > >: > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > > indexes > > > > > then > > > > > > > > > please > > > > > > > > > > go > > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > > discontinue > > > > > > > them > > > > > > > > > > > first > > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > > listed > > > > > by > > > > > > > Andrey > > > > > > > > > > and > > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > > [hidden email]> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > > TextQueries > > > > > in > > > > > > > > > Ignite > > > > > > > > > > > [1]. > > > > > > > > > > > > > Motivation here is text indexes are not persistent, > > not > > > > > > > > > transactional > > > > > > > > > > > and > > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > > and there is a lack of interest from community > side. > > > > > > > > > > > > > You are weclome to take on these issues and make > > > > > TextQueries > > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > > Query results return from data node to client-side > > > cursor > > > > > in > > > > > > > > > > > page-by-page > > > > > > > > > > > > > manner and > > > > > > > > > > > > > this parameter is designed control page size. It is > > > > > supposed > > > > > > > query > > > > > > > > > > > > executes > > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > > it is not excepted full resultset be loaded to > memory > > > on > > > > > server > > > > > > > > > side > > > > > > > > > > at > > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > > Do you mean you found Lucene load entire resultset > > into > > > > > memory > > > > > > > > > before > > > > > > > > > > > > first > > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > > result. > > > > > The > > > > > > > best > > > > > > > > > > > > > solution is to use query language commands for > this, > > > e.g. > > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > > in > > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is > distributed > > > > > operation > > > > > > > and > > > > > > > > > > same > > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > > and then results from all nodes should be correcly > > > merged > > > > > > > before > > > > > > > > > > being > > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > > So, LIMIT should be applied on every node and then > on > > > > merge > > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results > make > > no > > > > > sence > > > > > > > > > without > > > > > > > > > > > > > sorting, > > > > > > > > > > > > > as there is no guarantee every next query run will > > > return > > > > > same > > > > > > > data > > > > > > > > > > > > because > > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > > Basically, merge phase receive results from data > > nodes > > > > > > > > > asynchronously > > > > > > > > > > > and > > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) > looks > > > more > > > > > > > verbose, > > > > > > > > > > > isn't > > > > > > > > > > > > > it. > > > > > > > > > > > > > b,c. What about distributed query? How partial > > results > > > > from > > > > > > > nodes > > > > > > > > > > will > > > > > > > > > > > be > > > > > > > > > > > > > merged? > > > > > > > > > > > > > Does Lucene allows to configure comparator for > data > > > > > sorting? > > > > > > > > > > > > > What comparator Ignite should choose to sort result > > on > > > > > merge > > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at > all. > > > E.g. > > > > > it is > > > > > > > > > > > > impossible > > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > > I'd think about possible ways to configure engine > at > > > > first > > > > > and > > > > > > > only > > > > > > > > > > > then > > > > > > > > > > > > go > > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > > [hidden email]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open > discussion > > > that > > > > > would > > > > > > > > > come > > > > > > > > > > to > > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > > different > > > > > > > > > > mechanisms, > > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year > > release). > > > > > > > > > > > > > > This is a wide spread and mature technology that > > > covers > > > > > text > > > > > > > > > search > > > > > > > > > > > > area > > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality > to > > > > Ignite > > > > > > > > > indexing > > > > > > > > > > > and > > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > > coming > > > > > > > from our > > > > > > > > > > > > > project's > > > > > > > > > > > > > > needs, but i believe, will be useful for a lot > more > > > > > people. > > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > > tickets for > > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to > limit > > > > search > > > > > > > > > response > > > > > > > > > > > > items > > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > > calling > > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) > - > > so > > > > > > > basically > > > > > > > > > all > > > > > > > > > > > > > scored > > > > > > > > > > > > > > matches will me returned, what we do not need in > > most > > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > > call > > > > > can be > > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > > *true *the filed will be indexed but not > tokenized. > > > > > Number > > > > > > > types > > > > > > > > > > are > > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* > > constructor. > > > It > > > > > > > should > > > > > > > > > > define > > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with > > *TextQuery*, > > > > > > > including > > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > > *This section for voting only, as requires more > > > > detailed > > > > > > > work. > > > > > > > > > > Should > > > > > > > > > > > > be > > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > -- > > Best regards, > > Andrey V. Mashenkov > > > -- Best regards, Andrey V. Mashenkov |
In reply to this post by Yuriy Shuliga
Andrey,
Per you request, I created ticket https://issues.apache.org/jira/browse/IGNITE-12291 linked to https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-12189 Could you please proceed with PR merge ? BR, Yuriy Shuliha ср, 9 жовт. 2019 о 12:52 Andrey Mashenkov <[hidden email]> пише: > Hi Yuri, > > To get access to TC Bot you should register as TeamCity user [1], if you > didn't do this already. > Then you will be able to authorize on Ignite TC Bot page with same > credentials. > > [1] https://ci.ignite.apache.org/registerUser.html > > On Fri, Oct 4, 2019 at 3:10 PM Yuriy Shuliga <[hidden email]> wrote: > >> Andrew, >> >> I have corrected PR according to your notes. Please review. >> What will be the next steps in order to merge in? >> >> Y. >> >> чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <[hidden email]> >> пише: >> >> > Yuri, >> > >> > I've done with review. >> > No crime found, but trivial compatibility bug. >> > >> > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[hidden email]> wrote: >> > >> > > Denis, >> > > >> > > Thank you for your attention to this. >> > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 >> > ticket >> > > is still pending review. >> > > Do we have a chance to move it forward somehow? >> > > >> > > BR, >> > > Yuriy Shuliha >> > > >> > > пн, 30 вер. 2019 о 23:35 Denis Magda <[hidden email]> пише: >> > > >> > > > Yuriy, >> > > > >> > > > I've seen you opening a pull-request with the first changes: >> > > > https://issues.apache.org/jira/browse/IGNITE-12189 >> > > > >> > > > Alex Scherbakov and Ivan are you the right guys to do the review? >> > > > >> > > > - >> > > > Denis >> > > > >> > > > >> > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[hidden email]> >> > > wrote: >> > > > >> > > > > Yuriy, >> > > > > >> > > > > Thank you for providing details! Quite interesting. >> > > > > >> > > > > Yes, we already have support of distributed limit and merging >> sorted >> > > > > subresults for SQL queries. E.g. ReduceIndexSorted and >> > > > > MergeStreamIterator are used for merging sorted streams. >> > > > > >> > > > > Could you please also clarify about score/relevance? Is it >> provided >> > by >> > > > > Lucene engine for each query result? I am thinking how to do >> sorted >> > > > > merge properly in this case. >> > > > > >> > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[hidden email]>: >> > > > > > >> > > > > > Ivan, >> > > > > > >> > > > > > Thank you for interesting question! >> > > > > > >> > > > > > Text searches (or full text searches) are mostly human-oriented. >> > And >> > > > the >> > > > > > point of user's interest is topmost part of response. >> > > > > > Then user can read it, evaluate and use the given records for >> > further >> > > > > > purposes. >> > > > > > >> > > > > > Particularly in our case, we use Ignite for operations with >> > financial >> > > > > data, >> > > > > > and there lots of text stuff like assets names, fin. >> instruments, >> > > > > companies >> > > > > > etc. >> > > > > > In order to operate with this quickly and reliably, users used >> to >> > > work >> > > > > with >> > > > > > text search, type-ahead completions, suggestions. >> > > > > > >> > > > > > For this purposes we are indexing particular string data in >> > separate >> > > > > caches. >> > > > > > >> > > > > > Sorting capabilities and response size limitations are very >> > important >> > > > > > there. As our API have to provide most relevant information in >> view >> > > of >> > > > > > limited size. >> > > > > > >> > > > > > Now let me comment some Ignite/Lucene perspective. >> > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs >> > > *already >> > > > > > sorted by *score *(relevance). So most relevant documents are on >> > the >> > > > top. >> > > > > > And currently distributed queries responses from different nodes >> > are >> > > > > merged >> > > > > > into final query cursor queue in arbitrary way. >> > > > > > So in fact we already have the score order ruined here. Also >> Ignite >> > > > > > requests all possible documents from Lucene that is redundant >> and >> > not >> > > > > good >> > > > > > for performance. >> > > > > > >> > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and >> > have >> > > > to >> > > > > > notice that we still have to add sorting for text queries >> > processing >> > > in >> > > > > > order to have applicable results. >> > > > > > >> > > > > > *Limit* parameter itself should improve the part of issues from >> > > above, >> > > > > but >> > > > > > definitely, sorting by document score at least should be >> > implemented >> > > > > along >> > > > > > with limit. >> > > > > > >> > > > > > This is a pretty short commentary if you still have any >> questions, >> > > > please >> > > > > > ask, do not hesitate) >> > > > > > >> > > > > > BR, >> > > > > > Yuriy Shuliha >> > > > > > >> > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[hidden email]> >> пише: >> > > > > > >> > > > > > > Yuriy, >> > > > > > > >> > > > > > > Greatly appreciate your interest. >> > > > > > > >> > > > > > > Could you please elaborate a little bit about sorting? What >> tasks >> > > > does >> > > > > > > it help to solve and how? It would be great to provide an >> > example. >> > > > > > > >> > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < >> > > > > > > [hidden email]>: >> > > > > > > > >> > > > > > > > Denis, >> > > > > > > > >> > > > > > > > I like the idea of throwing an exception for enabled text >> > queries >> > > > on >> > > > > > > > persistent caches. >> > > > > > > > >> > > > > > > > Also I'm fine with proposed limit for unsorted searches. >> > > > > > > > >> > > > > > > > Yury, please proceed with ticket creation. >> > > > > > > > >> > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[hidden email] >> >: >> > > > > > > > >> > > > > > > > > Igniters, >> > > > > > > > > >> > > > > > > > > I see nothing wrong with Yury's proposal in regards >> full-text >> > > > > search >> > > > > > > API >> > > > > > > > > evolution as long as Yury is ready to push it forward. >> > > > > > > > > >> > > > > > > > > As for the in-memory mode only, it makes total sense for >> > > > in-memory >> > > > > data >> > > > > > > > > grid deployments when Ignite caches data of an underlying >> DB >> > > like >> > > > > > > Postgres. >> > > > > > > > > As part of the changes, I would simply throw an exception >> (by >> > > > > default) >> > > > > > > if >> > > > > > > > > the one attempts to use text indices with the native >> > > persistence >> > > > > > > enabled. >> > > > > > > > > If the person is ready to live with that limitation that >> an >> > > > > explicit >> > > > > > > > > configuration change is needed to come around the >> exception. >> > > > > > > > > >> > > > > > > > > Thoughts? >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > - >> > > > > > > > > Denis >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < >> > > [hidden email] >> > > > > >> > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hello to all again, >> > > > > > > > > > >> > > > > > > > > > Thank you for important comments and notes given below! >> > > > > > > > > > >> > > > > > > > > > Let me answer and continue the discussion. >> > > > > > > > > > >> > > > > > > > > > (I) Overall needs in Lucene indexing >> > > > > > > > > > >> > > > > > > > > > Alexei has referenced to >> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where >> > > > > > > > > > absence of index persistence was declared as an >> obstacle to >> > > > > further >> > > > > > > > > > development. >> > > > > > > > > > >> > > > > > > > > > a) This ticket is already closed as not valid.b) There >> are >> > > > > definite >> > > > > > > needs >> > > > > > > > > > (and in our project as well) in just in-memory indexing >> of >> > > > > selected >> > > > > > > data. >> > > > > > > > > > We intend to use search capabilities for fetching >> limited >> > > > amount >> > > > > of >> > > > > > > > > records >> > > > > > > > > > that should be used in type-ahead search / suggestions. >> > > > > > > > > > Not all of the data will be indexed and the are no need >> in >> > > > Lucene >> > > > > > > index >> > > > > > > > > to >> > > > > > > > > > be persistence. Hope this is a wide pattern of >> text-search >> > > > usage. >> > > > > > > > > > >> > > > > > > > > > (II) Necessary fixes in current implementation. >> > > > > > > > > > >> > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to >> be >> > > not >> > > > > > > required >> > > > > > > > > in >> > > > > > > > > > text-search tasks for now) >> > > > > > > > > > I have investigated the data flow for distributed text >> > > queries. >> > > > > it >> > > > > > > was >> > > > > > > > > > simple test prefix query, like 'name'*='ene*'* >> > > > > > > > > > For now each server-node returns all response records to >> > the >> > > > > > > client-node >> > > > > > > > > > and it may contain ~thousands, ~hundred thousands >> records. >> > > > > > > > > > Event if we need only first 10-100. Again, all the >> results >> > > are >> > > > > added >> > > > > > > to >> > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order >> by >> > > > pages. >> > > > > > > > > > I did not find here any means to deliver deterministic >> > > result. >> > > > > > > > > > So implementing limit as part of query and >> > > > > (GridCacheQueryRequest) >> > > > > > > will >> > > > > > > > > not >> > > > > > > > > > change the nature of response but will limit load on >> nodes >> > > and >> > > > > > > > > networking. >> > > > > > > > > > >> > > > > > > > > > Can we consider to open a ticket for this? >> > > > > > > > > > >> > > > > > > > > > (III) Further extension of Lucene API exposition to >> Ignite >> > > > > > > > > > >> > > > > > > > > > a) Sorting >> > > > > > > > > > The solution for this could be: >> > > > > > > > > > - Make entities comparable >> > > > > > > > > > - Add custom comparator to entity >> > > > > > > > > > - Add annotations to mark sorted fields for Lucene >> indexing >> > > > > > > > > > - Use comparators when merging responses or reducing to >> > > desired >> > > > > > > limit on >> > > > > > > > > > client node. >> > > > > > > > > > Will require full result set to be loaded into memory. >> > Though >> > > > > can be >> > > > > > > used >> > > > > > > > > > for relatively small limits. >> > > > > > > > > > BR, >> > > > > > > > > > Yuriy Shuliha >> > > > > > > > > > >> > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < >> > > > > > > > > [hidden email]> >> > > > > > > > > > пише: >> > > > > > > > > > >> > > > > > > > > > > Yuriy, >> > > > > > > > > > > >> > > > > > > > > > > Note what one of major blockers for text queries is >> [1] >> > > which >> > > > > makes >> > > > > > > > > > lucene >> > > > > > > > > > > indexes unusable with persistence and main reason for >> > > > > > > discontinuation. >> > > > > > > > > > > Probably it's should be addressed first to make text >> > > queries >> > > > a >> > > > > > > valid >> > > > > > > > > > > product feature. >> > > > > > > > > > > >> > > > > > > > > > > Distributed sorting and advanved querying is indeed >> not a >> > > > > trivial >> > > > > > > task. >> > > > > > > > > > > Some kind of merging must be implemented on query >> > > originating >> > > > > node. >> > > > > > > > > > > >> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 >> > > > > > > > > > > >> > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < >> > > [hidden email] >> > > > >: >> > > > > > > > > > > >> > > > > > > > > > > > Yuriy, >> > > > > > > > > > > > >> > > > > > > > > > > > If you are ready to take over the full-text search >> > > indexes >> > > > > then >> > > > > > > > > please >> > > > > > > > > > go >> > > > > > > > > > > > ahead. The primary reason why the community wants to >> > > > > discontinue >> > > > > > > them >> > > > > > > > > > > first >> > > > > > > > > > > > (and, probable, resurrect later) are the limitations >> > > listed >> > > > > by >> > > > > > > Andrey >> > > > > > > > > > and >> > > > > > > > > > > > minimal support from the community end. >> > > > > > > > > > > > >> > > > > > > > > > > > - >> > > > > > > > > > > > Denis >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < >> > > > > > > > > > > > [hidden email]> >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi Yuriy, >> > > > > > > > > > > > > >> > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue >> > > > TextQueries >> > > > > in >> > > > > > > > > Ignite >> > > > > > > > > > > [1]. >> > > > > > > > > > > > > Motivation here is text indexes are not >> persistent, >> > not >> > > > > > > > > transactional >> > > > > > > > > > > and >> > > > > > > > > > > > > can't be user together with SQL or inside SQL. >> > > > > > > > > > > > > and there is a lack of interest from community >> side. >> > > > > > > > > > > > > You are weclome to take on these issues and make >> > > > > TextQueries >> > > > > > > great. >> > > > > > > > > > > > > >> > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. >> > > > > > > > > > > > > Query results return from data node to client-side >> > > cursor >> > > > > in >> > > > > > > > > > > page-by-page >> > > > > > > > > > > > > manner and >> > > > > > > > > > > > > this parameter is designed control page size. It >> is >> > > > > supposed >> > > > > > > query >> > > > > > > > > > > > executes >> > > > > > > > > > > > > lazily on server side and >> > > > > > > > > > > > > it is not excepted full resultset be loaded to >> memory >> > > on >> > > > > server >> > > > > > > > > side >> > > > > > > > > > at >> > > > > > > > > > > > > once, but by pages. >> > > > > > > > > > > > > Do you mean you found Lucene load entire resultset >> > into >> > > > > memory >> > > > > > > > > before >> > > > > > > > > > > > first >> > > > > > > > > > > > > page is sent to client? >> > > > > > > > > > > > > >> > > > > > > > > > > > > I'd think a new parameter should be added to limit >> > > > result. >> > > > > The >> > > > > > > best >> > > > > > > > > > > > > solution is to use query language commands for >> this, >> > > e.g. >> > > > > > > > > > > "LIMIT/OFFSET" >> > > > > > > > > > > > in >> > > > > > > > > > > > > SQL. >> > > > > > > > > > > > > >> > > > > > > > > > > > > This task doesn't look trivial. Query is >> distributed >> > > > > operation >> > > > > > > and >> > > > > > > > > > same >> > > > > > > > > > > > > user query will be executed on data nodes >> > > > > > > > > > > > > and then results from all nodes should be correcly >> > > merged >> > > > > > > before >> > > > > > > > > > being >> > > > > > > > > > > > > returned via client-cursor. >> > > > > > > > > > > > > So, LIMIT should be applied on every node and >> then on >> > > > merge >> > > > > > > phase. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Also, this may be non-obviuos, limiting results >> make >> > no >> > > > > sence >> > > > > > > > > without >> > > > > > > > > > > > > sorting, >> > > > > > > > > > > > > as there is no guarantee every next query run will >> > > return >> > > > > same >> > > > > > > data >> > > > > > > > > > > > because >> > > > > > > > > > > > > of page reordeing. >> > > > > > > > > > > > > Basically, merge phase receive results from data >> > nodes >> > > > > > > > > asynchronously >> > > > > > > > > > > and >> > > > > > > > > > > > > messages from different nodes can't be ordered. >> > > > > > > > > > > > > >> > > > > > > > > > > > > 2. >> > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) >> looks >> > > more >> > > > > > > verbose, >> > > > > > > > > > > isn't >> > > > > > > > > > > > > it. >> > > > > > > > > > > > > b,c. What about distributed query? How partial >> > results >> > > > from >> > > > > > > nodes >> > > > > > > > > > will >> > > > > > > > > > > be >> > > > > > > > > > > > > merged? >> > > > > > > > > > > > > Does Lucene allows to configure comparator for >> data >> > > > > sorting? >> > > > > > > > > > > > > What comparator Ignite should choose to sort >> result >> > on >> > > > > merge >> > > > > > > phase? >> > > > > > > > > > > > > >> > > > > > > > > > > > > 3. For now Lucene engine is not configurable at >> all. >> > > E.g. >> > > > > it is >> > > > > > > > > > > > impossible >> > > > > > > > > > > > > to configure Tokenizer. >> > > > > > > > > > > > > I'd think about possible ways to configure engine >> at >> > > > first >> > > > > and >> > > > > > > only >> > > > > > > > > > > then >> > > > > > > > > > > > go >> > > > > > > > > > > > > further to discuss\implement complex features, >> > > > > > > > > > > > > that may depends on engine config. >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < >> > > > > > > [hidden email]> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > Dear community, >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > By starting this chain I'd like to open >> discussion >> > > that >> > > > > would >> > > > > > > > > come >> > > > > > > > > > to >> > > > > > > > > > > > > > contribution results in subj. area. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by >> > > > different >> > > > > > > > > > mechanisms, >> > > > > > > > > > > > > > including Lucene. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year >> > release). >> > > > > > > > > > > > > > This is a wide spread and mature technology that >> > > covers >> > > > > text >> > > > > > > > > search >> > > > > > > > > > > > area >> > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > My goal is to *expose more Lucene functionality >> to >> > > > Ignite >> > > > > > > > > indexing >> > > > > > > > > > > and >> > > > > > > > > > > > > > query mechanisms for text data*. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > It's quite simple request at current stage. It >> is >> > > > coming >> > > > > > > from our >> > > > > > > > > > > > > project's >> > > > > > > > > > > > > > needs, but i believe, will be useful for a lot >> more >> > > > > people. >> > > > > > > > > > > > > > Let's walk through and vote or discuss about >> Jira >> > > > > tickets for >> > > > > > > > > them. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to >> limit >> > > > search >> > > > > > > > > response >> > > > > > > > > > > > items >> > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is >> > > calling >> > > > > > > > > > > > > > IndexSearcher.search(query, >> *Integer.MAX_VALUE*) - >> > so >> > > > > > > basically >> > > > > > > > > all >> > > > > > > > > > > > > scored >> > > > > > > > > > > > > > matches will me returned, what we do not need in >> > most >> > > > > cases. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable >> search >> > > call >> > > > > can be >> > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, >> > > > > > > > > > > > > > sort) * >> > > > > > > > > > > > > > Implementation steps: >> > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in >> > > > > > > *@QueryTextFiled * >> > > > > > > > > > > > > > annotation. If >> > > > > > > > > > > > > > *true *the filed will be indexed but not >> tokenized. >> > > > > Number >> > > > > > > types >> > > > > > > > > > are >> > > > > > > > > > > > > > preferred here. >> > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* >> > constructor. >> > > It >> > > > > > > should >> > > > > > > > > > define >> > > > > > > > > > > > > > desired sort fields used for querying. >> > > > > > > > > > > > > > c) Implement Lucene sort usage in >> > > > > GridLuceneIndex.query(). >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 3.[moderate] Build complex queries with >> > *TextQuery*, >> > > > > > > including >> > > > > > > > > > > > > > terms/queries boosting. >> > > > > > > > > > > > > > *This section for voting only, as requires more >> > > > detailed >> > > > > > > work. >> > > > > > > > > > Should >> > > > > > > > > > > > be >> > > > > > > > > > > > > > extended if community is interested in it.* >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Looking forward to your comments! >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > BR, >> > > > > > > > > > > > > > Yuriy Shuliha >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > -- >> > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > Andrey V. Mashenkov >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > -- >> > > > > > > > > > > >> > > > > > > > > > > Best regards, >> > > > > > > > > > > Alexei Scherbakov >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > Best regards, >> > > > > > > Ivan Pavlukhin >> > > > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Best regards, >> > > > > Ivan Pavlukhin >> > > > > >> > > > >> > > >> > >> > >> > -- >> > Best regards, >> > Andrey V. Mashenkov >> > >> > > > -- > Best regards, > Andrey V. Mashenkov > |
Free forum by Nabble | Edit this page |