Long stop of node due to diagnostics of CorruptedTreeException

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Long stop of node due to diagnostics of CorruptedTreeException

Kirill Tkalenko
Hi everyone!

I found that if an CorruptedTreeException occurs, the node can fall for a long time (more than an hour), due to the fact that the DiagnosticProcessor#dumpPageHistory can use many WAL segments, which does not seem very good.

I propose to move the diagnostics to a IgniteWalConverter by adding a "pages" argument that will accept either a list of pages in the format "grpId:pageId,grpId:pageId,grpId:pageId", or a file in which each line will have an entry in the format "grpId:pageId".

And add the saving of pages to the file (DiagnosticProcessor#diagnosticPath + /сorruptedPages_ts.txt) in the class, and log a message to the user that he can run the IgniteWalConverter with an "--pages=DiagnosticProcessor#diagnosticPath + /сorruptedPages_ts.txt".