============0_intuition============ Napkin list of loose intuitions; I still don't know why. But in the subsequent chapters I already know what I noticed in the function and why it clicked for me. Refering as appendix for: https://entropment.com/media/downloads/PAPERS_Hierarchical_Metric_Flow_on_Data_Graphs.txt 0. The base object is memory recurrence Starting point: xₙ = xₙ₋₁ − xₙ₋₂ − xₙ₋₃ − ⋯ − x₀ generalizing: xₙ = ∑_{k=0}^{n−1} c_k x_{n−1−k} where: c₀ = 1, c_k = −1 (k ≥ 1) This is a full-memory recurrence. Formally I have the operator: R[x]_n = ∑_{k=0}^{n−1} c_k x_{n−1−k} i.e.: x = R[x] 1. Natural normalization condition: //control of that one at the end? ∑_{k=0}^m c_k = 1 Interpretations? - scale preservation; - no net global growth; - no amplitude drift; Probably the key is whether this can be a level function? Functional version: ∑_{k=0}^m c_k(n) = g(n) where: g(n) = 1 is the base case. Whereas: g(n) ≠ 1 means control of the global drift of the system. This exactly matches the intuition where instead of one there is a function. This is probably a natural control parameter. 2. Characteristic equation (local) If I assume the solution: xₙ = r^n I obtain: r^n = ∑_{k=0}^{n−1} c_k r^{n−1−k} i.e.: r = ∑_{k=0}^{n−1} c_k r^{−k} for a fixed order m: r^m = ∑_{k=0}^{m−1} c_k r^{m−1−k} i.e.: P(r) = 0 where: P(r) = r^m − ∑_{k=0}^{m−1} c_k r^{m−1−k} This is the characteristic polynomial of the recurrence. 3. Remark on signs (probably very important) Signs of the coefficients: c_k ∈ ℝ determine the type of dynamics. (A) All positive c_k ≥ 0 Consequence: - monotonic damping or growth - no oscillations Roots: r ∈ ℝ₊ (B) Alternating signs c_k = (−1)^k a_k Consequence: - oscillations - possibility of stabilization by interference Roots: r ∈ ℂ with nonzero imaginary part. (C) Minuses can be natural because... full memory: xₙ = xₙ₋₁ − ∑_{k=1}^{n−1} x_{n−1−k} is equivalent to: xₙ = xₙ₋₁ − ∑_{j=0}^{n−2} x_j i.e., subtraction of history This is rather a natural mechanism: - damping of memory - compensation of accumulation Not a coincidence, but a structural consequence? 4. Critical behavior (core of the dynamics) Consider the general form: xₙ = ∑_{k=0}^{n−1} a_k x_k This gives the polynomial: r^n = ∑_{k=0}^{n−1} a_k r^k i.e.: P(r) = r^n − ∑_{k=0}^{n−1} a_k r^k Criticality depends on the roots: r_i Stability: |r_i| < 1 the sequence decays. Criticality: |r_i| = 1 system on the boundary of stability. The most interesting case is explosion: |r_i| > 1 the sequence diverges. This looks like a diagnostic tool. For studying the course? Memory? I need to think about it. 5. Full memory as a convolution operator Recurrence: xₙ = ∑_{k=0}^{n−1} c_k x_{n−1−k} is discrete convolution x = c ∗ x where: c = (c₀, c₁, c₂, …) This gives the Z-transform: X(z) = C(z) X(z) i.e.: 1 − C(z) = 0 This is again the characteristic equation. 6. “Tilted” version (inhomogeneous) //critical line; Here something interesting begins. Form: xₙ = ∑_{k=0}^{n−1} a_k(n) x_{n−k} i.e., coefficients depend on the level. Formal interpretation of the operator: R_n[x] = ∑_{k=0}^{n−1} a_k(n) x_{n−k} i.e.: xₙ = R_n[x] This is a level-dependent recurrence. 7. Critical control function Condition: ∑_{k=0}^m a_k(n) = g(n) where: g(n) can be: - constant - function - curve This is exactly the “tilted version”. Relatively with an interesting function the “twisted version”. 8. Minimal operator apparatus. Putting it together. I have: (A) Memory operator R[x]_n = ∑_{k=0}^{n−1} c_k x_{n−1−k} (B) Normalization condition ∑_k c_k = g (C) Characteristic polynomial P(r) = r^m − ∑_{k=0}^{m−1} c_k r^{m−1−k} (D) Critical criterion max_i |r_i| = 1 (E) Tilted version xₙ = ∑_{k=0}^{n−1} a_k(n) x_{n−k} 9. Minimal starting examples. So that it is not abstract. Example 1 is full memory xₙ = xₙ₋₁ − xₙ₋₂ − xₙ₋₃ − ⋯ − x₀ i.e.: c₀ = 1, c_k = −1 Example 2 is alternating kernel c_k = (−1)^k i.e.: xₙ = xₙ₋₁ − xₙ₋₂ + xₙ₋₃ − … Oscillatory case. Example 3 is tilted a_k(n) = (−1)^k e^{−α k / n} i.e., memory decay dependent on the level. 10. The most “napkin” version of the intuition for further digging Minimal set: xₙ = ∑_{k=0}^{n−1} c_k(n) x_{n−1−k} with: ∑_{k=0}^{n−1} c_k(n) = g(n) and: r^n = ∑_{k=0}^{n−1} c_k(n) r^{n−1−k} and criticality: max |r_i| = 1 This is sufficient as a starting exploratory apparatus. 11. Something clicked intuitively here, probably because: //intuition, the whole construction simply came to my mind and I am writing it down; Form: xₙ = xₙ₋₁ − ∑_{j=0}^{n−2} x_j is almost equivalent to: xₙ = xₙ₋₁ − S_{n−2} where: S_n = ∑_{k=0}^n x_k i.e., coupling with the sum of history. It looks like a structurally strong form; probably the most interesting things happen right there. ============1_intuition_in_aplication============ It resembles Volterra difference equations. In the discrete version Volterra difference equation (with memory). The spectral radius of the operator is the standard stability test. Autoregressive model with long memory in particular AR(∞) [that is probably why it came to my mind and I liked it], long-memory processes, fractional models. x = c ∗ x is a linear time-invariant system (LTI); Z-transform; Population and aggregation models; coupling with the sum of history; aggregation dynamics with memory; So I can do asymptotic analysis: Especially when I introduce: Sₙ = ∑_{k=0}^n x_k i.e., the sum of history. This gives: xₙ = x_{n-1} - S_{n-2} and then I analyze: - whether Sₙ stabilizes - grows linearly - grows exponentially - oscillates The normalization condition is very sensible: ∑_k c_k(n) = g(n) i.e., mass / energy balance, which can be interpreted as: g(n) = 1 → scale preservation g(n) < 1 → damping g(n) > 1 → growth g(n) variable → system drift So a mechanism of global dynamics control. Curiosities... a_k(n) and level-dependent kernel. Nonlocal adaptive dynamics: - variable memory - variable operator - dynamic geometry of history Applications: (1) Asymptotic approximation Studying: xₙ ∼ rⁿ i.e., of the dominant root. This gives: - growth rate - decay rate - oscillativeness (2) Approximation by aggregation sequence I define: Sₙ = ∑_{k=0}^n x_k and analyze: Sₙ / n This gives the long-term average (3) Functional limits If: c_k(n) → c_k then often: xₙ approaches the solution of the constant operator. Memory stabilization scheme Application curiosities: (A) Coupling with the sum of history xₙ = xₙ₋₁ − Sₙ₋₂ Because: Sₙ becomes a higher-order state variable. Which gives: -natural damping -compensation of accumulation (B) Criticality condition max |rᵢ| = 1 is the classical phase boundary. This is where there occur: -oscillations -quasi-stability -slow decays (C) Level-dependent kernel aₖ(n) This is a model of memory with variable geometry. //And this is already useful for me; Good for studying aggregation sequences that I already have: -stabilization of sums -compensation of history -net equilibrium Scheme: 1. I define xₙ 2. I compute Sₙ 3. I study: lim_{n→∞} Sₙ or: lim_{n→∞} Sₙ / n Classical aggregation analysis. So studying: -long-range memory models -filters with adaptive memory -systems compensating accumulation -processes stabilizing history sums Very good for: -resource stabilization models -adaptive filters -dynamics with memory balancing ============2_practical_aplications_for_HFoDG============ 1. It looks like useful “interference”; //chapter further I move to the alphabet on this calculus If I have: xₙ = ∑_{k=0}^{n-1} cₖ x_{n-1-k} then: - each step is a superposition of history - the signs and values of cₖ cause constructive or destructive summation If the coefficients have alternating signs: cₖ = (−1)^k aₖ then: - some terms reinforce - others damp Effect: - oscillations - quasi-stabilization - damping despite large local values So something similar to: - difference filters - systems compensating accumulation 2. Detecting loops and oscillations is a very natural use; If I have a lot of aggregated histories, I can study: (A) Whether it falls into a loop Formally whether there exists p such that: x_{n+p} = xₙ In this model it means: - characteristic roots on the unit circle - arguments being multiples of 2π/p Practically: - I detect cyclicity - even if the amplitude changes (B) Whether it oscillates which corresponds to: complex roots: r = ρ e^{iθ} Then: xₙ ∼ ρⁿ cos(nθ) Trivial interpretation: - ρ < 1 → oscillations decay - ρ = 1 → persistent oscillations - ρ > 1 → exploding oscillations 3. Detecting explosion (very important in aggregation, to know when to stop, predefined stop signal) If I have a large number of data and aggregate them recursively, it is very easy to get: - uncontrolled growth - amplitude avalanche Test: max |rᵢ| is exactly what is used for: - stability diagnostics - explosion detection If: max |rᵢ| > 1 then I have a structural explosion, not a random one. So if it does not decay then it is worth damping the process at the resource limit checking whether it stabilizes further. 4. Exploration of history is very important, this is rather a strong point; I have something that can be treated as: Sₙ = ∑_{k=0}^n xₖ i.e., accumulation of history If I couple: xₙ = x_{n-1} − S_{n-2} then interesting things happen: - history starts to compensate itself - feedback couplings arise This gives the possibility of exploring memory stability Not only local, but global in time. I.e., recording the trajectory of the calculation course. 5. What I can practically study; If I have a lot of aggregated data, I can compute: (1) stability whether: |xₙ| remains bounded. This is the first test. (2) oscillativeness whether the sign of: xₙ changes cyclically. (3) periodicity searching for: x_{n+p} − xₙ → 0 for some p. (4) explosion whether: |xₙ| → ∞ (5) aggregate stabilization whether: Sₙ / n approaches a constant. This is very important in aggregating large quantities. 6. What is particularly interesting with a large number of aggregated objects Because I already have what I would like to aggregate in large quantity, i.e., processes of joining tokens in hierarchical steps into abstracts; This equation becomes an exploratory tool, not only generative. I can use it to: - filter histories - reinforce patterns - damp noise - detect instabilities 7. In practice it can work like an interference filter I have a kernel: cₖ Which acts as masking of history It reinforces some patterns and damps others This is exactly the mechanism of structure detection in time. 8. This is good for exploration of the parameter space I can change: - signs of cₖ - memory length - decay rate and study what happens with the course. This gives a map of system behaviors. 9. If the data are “large” it gets more interesting If I aggregate many sequences: xₙ⁽ⁱ⁾ then I can study: ∑ᵢ xₙ⁽ⁱ⁾ i.e., interference between histories. This can be a strong exploratory tool. 10. Then this construction is suitable for: - exploration of aggregation dynamics - detection of cycles - detection of resonances - study of memory stability - study of history asymptotics An interesting diagnostic tool for memory systems. 11. Technical warning Full memory: xₙ = ∑_{k=0}^{n-1} cₖ x_{n-1-k} has cost: O(n) per step. So: - with large histories - without memory decay it becomes expensive. But I can aggregate with loss of resolution on the Finsler like HMFoDG up to the ultrametric boundary. ============3_flow_analise============ 1. What am I actually doing here //I am making the alphabet I have: - many inference runs - each generates a history of states and the states are: - vectors - embeddings - graphs (e.g. centroids, relations) Formally: xₙ ∈ ℝᵈ or: xₙ ∈ V(G) i.e., recurrence on structural objects, not scalars. Then the equation: xₙ = ∑_{k=0}^{n-1} cₖ x_{n-1-k} becomes: 𝐱ₙ = ∑_{k=0}^{n-1} cₖ 𝐱_{n-1-k} //interpretation/source of x changes; i.e., linear interference of embedding histories. Even sensible, and it looks the same. 2. I analyze not the result, but the trajectory; This way I can link together the results of inference and examine them I do not analyze only the final result, but the entire course over time I.e., trajectory signature. 3. “Time skew” is not a problem, but a signal //different structures of operation allocation to threads, mess in communication through the kernel, delays between CPU cores or network connection of a larger number of execution units; Recalculating the same thing with a different division between CPU cores in a different order creates: - different trajectories - but from the same logic I.e., different realizations of the same structure. This is interesting material for: - analysis of process stability - detection of degeneration - detection of bifurcations 4. The equation as a “trajectory aggregator” If I have the course: x₀, x₁, x₂, … I can build: yₙ = ∑_{k=0}^{n-1} cₖ x_{n-1-k} This gives an aggregated representation of history. Interpretation would be useful: - repeatable patterns → reinforce themselves - random fluctuations → damp So a mechanism of interference summary of the process. 5. Possible process similarity test: “Was the processing process “similar” to an already known one?” Scheme: Data: - I have a known reference trajectory: rₙ New course: xₙ I build: Yₙ⁽ˣ⁾, Yₙ⁽ʳ⁾ i.e., history aggregates. Then I compute: ⟨Yₙ⁽ˣ⁾, Yₙ⁽ʳ⁾⟩ or: ‖Yₙ⁽ˣ⁾ − Yₙ⁽ʳ⁾‖ If: - small → processes similar - large → processes structurally different Not only the result, but the process. 6. I can detect “whether it is the same type of process” //similarity Not only whether the result is similar, but whether the path to the result was similar. This is supposedly a huge difference. //not for me, because I see no difference between heat generation calculus in FPU, black hole radiation and the cost of using an eraser in Maxwell’s Demon notebook, for me it is all Landauer in different mutations; I can detect: - instabilities - anomalies - computational drift - sensitivity to order 7. This is very interesting with graphs that I operate on; Because: xₙ can be: - graph embedding - centroid - flow metric The recurrence acts like a memory filter on the graph structure. Which allows studying: - whether the graph stabilizes - whether it starts to oscillate - whether it changes topology Graph of inference/process/operations. Not what is happening there but how it is happening. 8. What resembles analysis of trajectory resonances If I use: cₖ = (−1)ᵏ or: cₖ = e^{-α k} then different time patterns have different responses. I.e., different trajectories “resonate” differently. Which allows: - classifying courses - detecting similarities 9. A very important point is nonlinearity through CPU scheduling Because I can force them to be nonlinear, and exactly as I want at the start, and later Lyapunov will do its job; I.e., in practice controlled perturbation. By changing the order: - I change local dynamics - but not global logic This allows studying semantic stability of the process Whether small perturbations → large changes? If so then I have an unstable pipeline. And it can be detected. 10. So the application is process fingerprint I can build: F = ∑_{n=0}^N wₙ xₙ where: wₙ = cₙ This gives a trajectory fingerprint. Then I compare fingerprints. 11. And even better is detection of loops and oscillations in the pipeline If the pipeline falls into: - repeatable patterns - deadlock-like behaviors - topological cycles then I will see these oscillations in the aggregate. And I have a stop point. This works even if individual states look similar 12. What fits the HMFoDG context (embeddings + graphs) - I have hierarchical structures - I have metric flows - I have many trajectories Operator: R[x]ₙ = ∑ cₖ x_{n-1-k} is a memory operator on the metric trajectory. 13. I will sketch the procedure so there is no aggressive jump in reasoning; step 1 is recording the trajectory for the course: x₀, x₁, …, xₙ step 2 is interference aggregation yₙ = ∑_{k=0}^n cₖ x_{n-k} step 3 is final fingerprint F = yₙ step 4 is comparison of fingerprints between courses. Which gives comparison of inference courses. 14. I will find out... Not only whether the process was similar, but whether it was stable with respect to order perturbations. Because I can run the same in a different order of linking (so the one path problem of LLM inference and neurotypical people requiring narration and heuristics). And on a linear, deterministic system I will check whether it works the same in every direction, and identify exceptions. Strong in systems: - multi-threaded - graph-based - order-dependent I.e., in what I have in HMFoDG 15. I will summarize the thought, because apparently I do it too rarely; Trajectory similarity under scheduling perturbations so whether different executions of the same logic lead to: - the same trajectory or - structurally different trajectories. Practical; ============4_flow_fingerprint============ 1. Input I have flows on graphs (series of operations in various mixed metrics); //I repeat myself because I need to think it through and put it together; From HMFoDG: - I have hierarchical graphs - with metric flow - on many levels simultaneously - with level weighting I.e., from the literature, so as not to get lost: - flow hierarchy - metric flow - information flow on graphs So objects that evolve in time as flows over the graph. In such models, analysis of trajectories of metric or graph structure changes, not only static states, makes sense. The recurrence is a filter of the history of such a flow. Not a generator, but a filter. 2. Formally I have a trajectory: x₀, x₁, x₂, … where: xₙ = { embedding graph feature vector } i.e., a multi-level state. Then I do: yₙ = ∑_{k=0}^{n-1} c_k x_{n-1-k} This is a memory projection of the trajectory: - which is not a sum of data - it is a projection of history onto a memory filter 3. I can study “process similarity” As similarity of the flow trajectory. 4. And since I expect time-skew I obtain: - different trajectories - of the same logic I.e., different realizations of the same flow. If the pipeline is stable the trajectories should rather: - interfere constructively - give a similar aggregate If not then there appears: - drift //i.e., these were different inference flows; - oscillations //tautologies? - bifurcations //very different 5. So I have a process fingerprint Not a data fingerprint, but a course fingerprint. Scheme for the trajectory: x₀, …, x_N I compute: F = ∑_{k=0}^N c_k x_{N-k} This is the history signature. Then I compare: ‖F₁ − F₂‖ between executions. 6. I detect “whether the pipeline behaves topologically the same” Not only numerically, but topologically. For example whether: - the graph structure stabilizes - changes class - enters cycles Sensible in the analysis of hierarchical graph flows; I study the structure of information flow between layers. 7. The most practical use is the pipeline stability test. Procedure: 1 - I run the process several times (with different scheduling) 2 - I compute history aggregates 3 - I compare them If: ‖F_i − F_j‖ small → stable large → sensitive to order 8. Three-level processing I process all three levels at once and weight them //euclid distance, hyperbolic cosine, finsler p-addi until ultrametric regime; Then: xₙ = w₁ vₙ + w₂ eₙ + w₃ gₙ where: - vₙ is the raw vector; - eₙ is the direction embedding (normalized vector); - gₙ is the graph (p-adic interpretation for a given lambda); This will give me a multi-channel trajectory and a multi-level filter. Because something may explode in Euclid and oscillate (get stuck) in ultrametric. Then the filter will be somewhere on the lambda slider. I.e., the range of reasonable assumptions, their numbers that are worth taking into account, to obtain at least a sketch of inference and only then consider them erroneous or in some way sensible. 9. With this construction I can examine... (A) whether the pipeline falls into loops symptom: x_{n+p} ≈ x_n After aggregation there will be oscillations. (B) whether drift appears: ‖xₙ‖ grows systematically. (C) whether there are bifurcations when a small change in order: → large change in trajectory Which is important with CPU scheduling. And may require quite interesting hierarchical throttles. Competition of cycles. (D) whether the process is “topologically similar”, so whether the graphs end up in: - the same class - similar structure 10. Objects evolve through flows in metric space. Different metrics. This naturally leads to: - trajectories - flows - stabilization of structures This will be useful for modeling the evolution of data geometry and interpolation of trajectories in metric space. 11. I can force nonlinearities through scheduling Which gives controlled perturbations I.e., I can study structural stability of the process. So not a mathematical test, but an engineering pipeline test. 12. I need to make a library of memory filters i.e., various: c_k for example... damping filter c_k = e^{-α k} for stability testing. oscillatory filter c_k = (-1)^k e^{-α k} for cycle detection. threshold filter where memory only up to: k ≤ m for locality testing. This will give a family of trajectory detectors. This might work. ============5_L_over_trajectory_calc============ Model of language over computation trajectories and memory filter as an analyzer of stability of these trajectories. It is worth writing this intuition as technically as possible. 1. The key idea is the pipeline as a “word” //well some series of markers, something like a vector with numerical positions, as in p-adics; I treat it as a "word"; long, but it is some symbolic string of operations //I adopt this intuition; I have: W = o₁ o₂ o₃ … oₙ //this will soon change to a tree, for now I am thinking out loud; where: oᵢ ∈ Σ i.e.: - alphabet = types of operations - word = pipeline course So I am in the field of Formal Language Theory. Safely. And Sequence Modeling. Only this is not text, but a computation trajectory. 2. n-grams as detectors of heuristics... //preliminary plan, intuition; n-grams will probably be recognizable on the fly; I have: (oᵢ, o_{i+1}, …, o_{i+n-1}) i.e., local operational pattern Interpretatively this is something like: - heuristic = repeatable fragment of the word - strategy = sequence of n-grams This leads to detection of process schemes (let's assume inference or analysis) not only results. 3. Loops as repeatable n-grams where I will be able to extract loops Formally a loop is: (u)ᵏ i.e., repetition of the word u. This gives: - heuristic cycles - dead-ends - optimal paths I.e., somewhere in Automata Theory, especially if it serves for detecting cycles in automata. And after all it runs on a computer. Condition satisfied. 4. “Intuition” as trajectory memory //computational property along the trajectory as "intuition" for the inference process, not mine; Initialized "intuition" (heuristic? something like that) is in practice a system for predicting trajectory based on history. //so I can call it this in functions and variables so as not to get lost in the code; Scheme, if: (o_{n-k}, …, oₙ) matches a known pattern then: - it predicts further course - or rejects the branch //depending where it led, or rejects to the stack for examination; Such predictive pruning. 5. Metathinking (literally, assuming this is the inference process) Formally this is a meta-model over inference trajectories. Not a data model, but a process model. And I enter the area of Meta-Learning and Program Analysis. 6. The earlier recurrence fits here Because I have two levels: level 1 - language of operations W = o₁ o₂ … oₙ i.e., word. level 2 - memory aggregation, where: yₙ = ∑ₖ cₖ x_{n-k} can act on: - n-gram embeddings - or pattern counters I.e., interference of heuristic patterns. Something like an integral over different trajectories of the process executed in different order allows determining how to feed it to the machine for execution, because it is known. And we know when we are trying something innovative, and when we are doing something that did not work before (explosion or dead end). 7. Detection of “processes leading nowhere”; It will allow prior rejection of certain processes which is very realistic. If a certain pattern: uᵏ historically led to: - explosion - lack of solution then I can mark it as toxic and: - early terminate the process - or change strategy Classic early termination heuristic. However it is worth marking in the answer that a heuristic was used, and in the process we know about it. 8. Explosion detection fits very well If I have a pattern, let’s say: A B C D E F G H I J … which historically → generated more and more operations then I can detect patterns of complexity growth. Even before the explosion happens. And think about the decision. 9. Curiosity “comparing courses”; I have two words: W₁, W₂ I compare: - n-grams - their order - lengths Which gives a metric of process similarity. 10. Partial matching because I do not have to have identical words. I can have: - similar fragments - different endings //prefix/infix/suffix Which allows: - detecting alternative solutions - exploring the tree of possibilities And I can add more solutions, branches to a given prefix, i.e., examining on the fly I will know which direction it is going. Because certain solutions may lead nowhere. 11. The OS messes with operations and their order already at the kernel level and async data flow; So it introduces order noise, but if the heuristic is stable then n-grams will repeat despite this. A good diagnostic property. 12. It resembles building a “language of thinking” I have: - alphabet of operations - grammar of heuristics - history of trajectories So a formal language of inference. 13. The benefit is rejecting dead paths, computational gain. Because instead of exploring everything I do pruning based on history. 14. If the words are very long then n-grams explode numerically. One solution is to limit: - maximum n length - or use variable-length n-grams But I also have another way :) 15. If I combine: - operation n-grams with - state embeddings then I get semantic n-grams. Not only what was done but what it meant in the process. 16. Over time I will start to have a dictionary of heuristics. Not hand-written. Learned from history. ============6_autofacepalm============ 1. So what do I have assuming that I dump the process to a file, for later, and not analyze it on the fly; -asynchronous inference results -I collect them into a sequence of operations -I treat the sequence as: S = (o₁, o₂, o₃, ..., o_T) i.e., a word over the alphabet of operations; Then: - I generate n-grams - I detect repeatable patterns - I mark them as: - heuristics - loops - erroneous paths - exploding paths Normal analysis of the operational language of the process. 2. What it really allows to study: loops if I have: (..., A, B, C, A, B, C, A, B, C, ...) then n-grams will detect: (A, B, C) (A, B, C) (A, B, C) as an operational cycle, therefore: - cycle detection in graphs - loop detection in dynamic systems oscillations, if I have: A, B, A, B, A, B, A, B, A, B then: - bigrams → (A, B) - trigrams → (A, B, A) will detect the oscillatory character of the process Just like in: - analysis of discrete oscillators - analysis of iterative dynamics explosion (divergence) e.g.: A, A, A, A, A, A, ... or: - increasing number of unique operations - increasing entropy of n-grams indicates: - instability - divergence of the process Which can be formalized as H(n-grams) ↑ i.e., pattern entropy grows. 3. Process similarity Trajectory similarity i.e., comparison of operation trajectories. I have two processes: S₁, S₂ I compare: - sets of n-grams - histograms of n-grams - distance between them e.g.: d(S₁, S₂) One can use: - Jaccard similarity - cosine similarity - edit distance 4. And from the file I can obtain a “residual facepalm”. Residual process checking previous reasonings: - execution trace mining - dynamic analysis - invariant discovery the system after some time will return: - this fragment was redundant - this fragment leads to dead-ends In RL there is something very similar: experience replay, where the agent: - performs actions - later analyzes the history - learns heuristics I.e., close to meta-heuristic mining, LLM reasoning trace analysis. 5. The model can recognize that this has already happened; Meta-trajectory recognition i.e., the process learns structures of its own inference. Not only results. 6. Model = language over operations I treat trajectories as words This means that the system becomes L = language over operations. And then I can apply: - formal language theory - Markov models - automata - grammars I.e., someone has already done all the work and one can use it without reinventing the wheel. 7. Typical tools: - α-algorithm - Heuristics Miner Execution Trace Analysis used in: - debugging - compiler optimization //because in principle I am making a precompiler of prompts and division of complex work between agents; Symbolic dynamics where a string of symbols: S = (s₁, s₂, s₃, ...) describes the dynamics of the system. Compression-based similarity e.g.: - LZ complexity - Kolmogorov complexity estimates comparing sequences through compression. 8. Which allows: automatically detecting heuristics e.g. if: (A, B, C) → success then the system can mark it as an effective heuristic. rejecting paths e.g.: (D, E, F) → failure marking it as a bad pattern detecting infinite loops e.g. repeating n-grams. analyzing stability e.g. whether the process: - stabilizes - oscillates - diverges 9. It will work well for -embeddings -graphs -iterative inferences i.e., exactly what I am doing; //I am only getting used to this idea because there was a problem why one inference trajectory is good and another is bad; Operations are: -discrete -sequential -aggregative 10. The biggest risk is n-gram explosion, but this can be controlled by using: limited n e.g.: n ≤ 6 //or some selected number in the range; hashing n-grams like rolling hash pruning rare patterns i.e., ignore n-grams with low frequency. 11. An initialized intuition may arise as online pattern recognition; the system: -sees a fragment -matches it to a known one -predicts what comes next, or asks; So it collects experience without collecting context; ============7_hierarchical_alphabet============ The alphabet is hierarchical, not flat. 1. I assume working levels: Level 0 - token level, for example: • token → role (operation / argument / condition) i.e.: ti → ri Semantic parsing. //I have it, check; Level 1 - local aggregation There arise: - pools of resolutions - local aggregation results i.e.: A₁ = f(t₁, t₂, ...) These are already local hypotheses. //check, in progress; Level 2 - selection and filtering Here the selection of the most sensible solutions takes place i.e.: A₂ = g(A₁) This is hypothesis ranking. //well this is what I needed; Level 3 - iterative inference from models i.e.: - queries to models - aggregation of responses A₃ = h(A₂) therefore solver pipeline Level 4 - meta-analysis of history i.e.: - comparison with previous courses - strengthening of heuristics A₄ = m(A₃, history) Gives meta-learning over courses 2. The alphabet is not linear — it is tree-like T = tree of operations and not: S = (o₁, o₂, o₃) I.e., I have tree-grams, not only n-grams. Which will be useful in self-similar loops. 3. It resembles Hierarchical Process Mining where: - operations have levels - process structures are detected - strategies are reconstructed Program trace mining where one analyzes: - function calls - execution paths But here I will do trace reasoning, not trace execution. There is a difference. 4. The goal is to heuristically save on tokens by recognizing reasoning patterns and giving hints I.e., strategic prompting based on history. E.g.: instead of “solve the problem” I give “compare with the previous case X, exclude variant Y, check differences Z”. Which enormously reduces: - search space - computational cost 5. “Precompiler of prompts”; Solver for the prompt precompiler The goal is prompt optimizer pipeline, but with memory of courses. S 6. Hierarchy of decisions allows detecting: - not only operation patterns - but strategy patterns I.e., not how to do something well, but when something goes wrong and has passed through individual sieves. Such large loops, when every step in checklists was consistent, but everything drifts and the result will be detached. 7. The most natural formalization is probably grammars over courses G = (N, Σ, P, S) i.e., grammars where: - Σ — atomic operations - N — aggregation operations - P — combination rules - S — final strategy Which allows detecting: - repeatable reasoning schemes - decision structures 8. Such Hierarchical Reinforcement Learning where there are: - macro-actions - strategies - meta-strategies Here I make an analog for reasoning, not moves. 9. I can detect when it is NOT worth asking questions; Because the most expensive operation is a query to the model. If I can: - recognize a known pattern - predict the result then I save this cost. 10. The hierarchical alphabet must be controlled; Because otherwise it explodes. The most common error in such systems is too detailed symbols It is better to make: -symbol ≠ specific operation -symbol = class of operation I.e., again p-adic and boundary classes; 11. What I can detect in such a system (practically): - effective heuristics e.g. pattern: classify → refine → compare → accept → high probability of success. - erroneous heuristics e.g. expand → expand → expand → explode → divergence. - dead paths e.g. try → contradict → retry → contradict → dead-end. - redundancies e.g. the same decision multiple times. The best may be remembering only: - repeatable patterns - critical patterns - erroneous patterns And not everything. So there must be some heap for collecting, in order to discover new repetitions. Examined from time to time, when enough has accumulated.