Author: Jack Kowalski https://entropment.com PATENT PENDING Apendix implementation1, notes from implementation and reasoning for Hierarchical_Metric_Flow_on_Data_Graphs; I have included here a collection of notes concerning thoughts and conclusions related to the discussed basic document. Mainly related to solving interesting facts, problems and other, side computational applications there. ========== 01========== Formalism of mapping direction to an ultrametric address, its limits and the graph variant with n(n-1)/2 interactions. It leads to a natural quantization through finite references, even for an infinite graph. Cunning plan: 1. Definition of direction mapping → p-adic address 2. Limit of infinite resolution 3. Reconstruction (approximate inversion) 4. Version without normalization (direction + radius) 5. Formal limitations 6. Graph version n(n-1)/2 7. Forced quantization through finite references 8. Examples of applications 1. Formal definition of the direction mapping Let: v ∈ ℝⁿ, v ≠ 0 I normalize: v̂ = v / ‖v‖ ∈ S^{n-1} i.e. a point on the unit sphere. Trivial obviousness in hypermetrics. Resolution level function Let: ϕₖ: S^{n-1} → {0,1,…,p-1} denote the choice of digit at level k. Each: ϕₖ(v̂) selects: a region of the sphere. Ultrametric address We define: U(v̂) = ∑_{k=1}^∞ ϕₖ(v̂) p^{k-1} i.e.: U: S^{n-1} → ℤₚ Interpretation Prefix of length k: Uₖ = (d₁, d₂, …, dₖ) defines the region: Ωₖ ⊂ S^{n-1} such that: Ω_{k+1} ⊂ Ωₖ i.e.: a sequence of nested regions. 2. Limit of infinite resolution Key object: ⋂_{k=1}^∞ Ωₖ If: diam(Ωₖ) → 0 then: ⋂ₖ Ωₖ = {v̂} i.e.: the address uniquely determines the direction. This is: “direction by one number” in the limiting sense. 3. Ultrametric //in accordance with the initial aggregation functor in the ultrametric limit, but still Finsler, only pushed to the extremes lambda → ∞; Let: U(v), U(w) be the addresses of two points. I define: LCP(U,V) = max k: Uₖ = Vₖ Longest common prefix. Metric: dₚ(v,w) = p^{-LCP(U,V)} It satisfies: dₚ(x,z) ≤ max {dₚ(x,y), dₚ(y,z)} i.e.: ultrametric. 4. Reconstruction (approximate inversion) //this can be wrapped, the problem of 2π rad on π rad and inversion disappears; Prefix of length k: Uₖ represents: a directional cone: Cₖ ⊂ S^{n-1} with angle: θₖ ∼ p^{-k} i.e.: reconstruction is a cone, not a point The limit: k → ∞ gives: a point. 5. Version without normalization (direction + radius) If: v ∈ ℝⁿ without normalization I encode: v → (r, v̂) where: r = ‖v‖ Then: U(v) = Uᵣ(r) + U_θ(v̂) i.e.: radial-directional address; //I will note that it is not a point but a boundary class; for now it is an equivalence class, but when I start using it, it will turn out I am on boundary classes; such an assumption has been in my notes from the beginning, but losing this context may be misleading, so I remind it from time to time; Interpretation: * radial prefix → distance * directional prefix → direction //big endian facilitates sorting; 6. Formal limitations (1) Infinity of digits An exact point requires: k → ∞ In practice: k < ∞ i.e.: I always have: a cone of non-zero width (2) Discontinuity of the inverse mapping U^{-1} is not a continuous function: it has jumps when the prefix changes.//we can of course use overlap on boundary classes which avoids jumps, but it will cause that for certain boundary conditions in certain projections the resolution will be exclusively statistical; which is reasonable and that is why we do triangulation from several projections, i.e. in measurement practice “from several detectors”; (3) Aliasing If: k is too small: many points: v map to: the same prefix. 7. Graph version i.e. the case n(n-1)/2;//in principle the only one worth attention; Rather important. Particle system Let: V = {v₁, …, vₙ} be a set of points. Between each pair: (i,j) we define the direction: d̂_{ij} = (vⱼ - vᵢ) / ‖vⱼ - vᵢ‖ Number of such relations: n(n-1)/2 Relational address Each relation: (i,j) receives: U_{ij} = U(d̂_{ij}) i.e.: relational direction Interpretation “black sky”//i.e. where further aggregation brings nothing; For each point vᵢ: the set: {U_{ij}}_{j ≠ i} is: the projection of the sky seen from vᵢ //from projection onto the graph; i.e.: local direction map. 8. Forced quantization through finite references is a consequence. I assume... Let: R = {r₁, …, rₘ} be a set of references. Each point: v I describe by: d̂(v, rₖ) i.e.: directions to the references. Number of possible states If: prefix of length: k then the number of possible states: p^{k m} i.e.: finite. Even for an infinite graph If: m < ∞ then: the number of distinguishable states is finite i.e.: quantization is forced. Independent of the number of nodes. 9. Natural “fuzziness” For a prefix of length: k the point is not: a single one, but: Ωₖ i.e.: fuzzy region This is natural fuzziness. Without additional functions. 10. Formal applications (A) Point rendering Address: U defines: a cone of view. Natural: * LOD * depth sorting * no z-buffer (B) Star map Without normalization: (r, v̂) → address. Each star one number.//each particle, each electron – you name it what the graph and render is of; (C) Indexing of embeddings Instead of cosine search I use LCP i.e. prefix tree... works nicely because one can sort alphabetically, no-brainer; (D) Particle graphs Each node local sky. Interactions via prefix match. 11. Boundary case i.e. complete relations If: n → ∞ but the number of references: m is constant, then: each node is described by: m directions. I.e. the global structure remains quantized 12. The deepest implication that the construction gives a geometric transition between the Euclidean metric and the ultrametric. Formally: lim_{k→∞} dₚ → d_angle in the sense of convergence of regions. 13. Probably important Not only “direction by one number”. It is also a hierarchical representation of directional relations which: * works locally * scales globally * forces quantization through references * is naturally fuzzy 14. To sum up Formally: S^{n-1} ⟶ ℤₚ with the property: Uₖ ↔ directional cone and for the graph: n(n-1)/2 relations ⇒ local sky for each node as well as: m < ∞ ⇒ forced quantization Directional calculus on the graph, ultrametric and by definition fuzzy. ========== 02========== 1. Leibnizian assumption i.e. only relations exist, we play with a graph from a set; I do not assume: ℝⁿ nor any space. I assume: G = (V, E) where: * V = {v₁, v₂, …} i.e. a set of particles (nodes) * E are observable relations; //i.e. I look around with various metrics and try to figure out what is going on in the “world”/in the graph; Let: R_{ij} denote the relation between particles vᵢ, vⱼ. We do not interpret it as distance or direction — it is simply: R_{ij} ∈ ℛ an element of a certain set of relations. 2. Local perception i.e. “black sky” For each particle: //node vᵢ I define: Nᵢ = {R_{ij}}_{j ≠ i} i.e. the set of relations visible from node i. This is the local relational sky 3. Hierarchical encoding of relations I assume that there exists a procedure for distinguishing relations by levels: ϕₖ: ℛ → {0, …, p-1} Each level distinguishes relations more and more accurately. Relation address U_{ij} = ∑_{k=1}^K ϕₖ(R_{ij}) p^{k-1} i.e.: R_{ij} → U_{ij} ultrametric address of the relation. 4. Particle state as a composite address The state of a particle: vᵢ we define as: Uᵢ = (U_{i1}, U_{i2}, …, U_{in}) i.e. a vector of relational addresses. This is the full relational state. 5. Overlap of relations (“shared sky”) For two particles: vᵢ, vⱼ I define: S_{ij} = Nᵢ ∩ Nⱼ i.e. the set of common references. Interpretation of the measure of “relational closeness” Intuitively, the farther the relations, the smaller the common set formally: |S_{ij}| ↓ when: i, j are less related. 6. Binarity (lack of direct visibility) If: R_{ij} ∉ E then the relation is indirect. Then: U_{ij} is determined by: a path in the graph: vᵢ → vₖ → ⋯ → vⱼ i.e. indirect relations preserve the structure. This corresponds to such a notion of binarity “binarity”. That is, there are projections where it is impossible to agree on a common relation with sufficient precision, which often means “with none” in many cases. 7. Ultrametric as a consequence of world coherence I assume optimistically that the world is stable: R_{ij} ≈ R_{ik} ⇒ R_{jk} i.e. that relations are consistent. Or at least not strongly inconsistent. Because if I did not assume that, I would have nothing to calculate. From this it follows: LCP(U_{ij}, U_{ik}) ≥ min { LCP(U_{ij}, U_{jk}), LCP(U_{jk}, U_{ik}) } i.e. the ultrametric inequality Not assumed, but resulting. Though it may be a tautology of the assumptions. I have suspicions about such a bias. 8. Entanglement as sharing of relations The state of the system: (vᵢ, vⱼ) is entangled if: U_{ij} does not decompose into: Uᵢ ⊕ Uⱼ i.e. the relation is not local. Formally: entanglement = non-decomposability of the relation This is a rather natural graph interpretation. 9. Why quantization appears This follows from one fact: |Nᵢ| < ∞ i.e.: finite number of references. Theorem (relational quantization) If: * the number of references is finite * the resolution is finite then the number of distinguishable states: < ∞ i.e. quantization follows from referential limitations Not from space. 10. Why continuity still appears If: |Nᵢ| ≫ 1 i.e.: very many references, then: the regions: Ωₖ are very small. The limit: |Nᵢ| → ∞ gives apparent continuity. And one can be fooled. I.e.: continuity = limit of many references 11. Why noise is unavoidable If for a certain relation: there is a lack of references: S_{ij} ≈ ∅ then the address: U_{ij} is poorly defined. This is relational noise It cannot be removed. One can only increase the number of references. I.e. the common problem we have in measurement and reconciliation of results. That is why we invented probabilistic models. Then it is easier to describe the phenomenon, but not necessarily to learn anything about it. For engineering purposes it is sufficient. 12. Version for q-bits The q-bit state: |ψ⟩ can be treated as a relation to the set of references: {rₖ} Address: U_ψ = (U_{ψ r₁}, …, U_{ψ rₘ}) i.e. state as a set of directional relations. Interpretation: superposition: = sharing of regions. Entanglement: = sharing of addresses. 13. Why Hilbert is not necessary Not denying its usefulness one can say: Hilbert: ℋ is: the continuous limit: |Nᵢ| → ∞ This construction works for: |Nᵢ| < ∞ i.e. it is more “operational”. Therefore easier for computational applications. 14. n-dimensional maps (including 3D) For each particle I have: Uᵢ = (U_{i1}, …, U_{im}) i.e. a local relational map. Reconstruction of space If the relations are consistent globally,//or on high-width locality because of binarity; then one can recover the embedding: V → ℝⁿ (e.g. via MDS). I.e. space emerges from relations Not the other way around. 15. Why this simplifies calculations Because instead of continuous vectors I have prefixes i.e. operations: * LCP * prefix comparison * tree instead of: * matrices * projections * norms 16. The most important consequence of boundary classes If two addresses have a common prefix of length: k then they belong to the same class: [U]_k i.e. equivalence classes of directions This is a natural definition of “blurring”. 17. Intuition about star maps //because the motivation of this note was “can I describe the direction on an n-sphere with one number?”, and since “direction is also a region of space?”; I.e. a very good example. Without normalization: v → (r, v̂) so the address encodes: * direction * distance This gives a natural star index. 18. What is really new here conceptually //probably... Not that “ultrametric instead of space”. Only relations → hierarchy → geometry i.e. geometry is not primary. It is secondary. 19. The greatest practical potential Not in fundamental physics because there are immediately a hopsztylion of postulates and ontologies, but in: 1. relational graphs 2. embeddings 3. many-body simulations 4. mapping of point spaces 5. point rendering Because there the hierarchy of relations is natural. And when physics is pushed onto algebras where hierarchy is forced and the “hierarchy problem” disappears, then panic in the eyes, because the analytical tools start to look alien. Provided they exist in the absence of exact inversion. 20. The construction formally corresponds to: world = graph of relations direction = ultrametric address continuity = limit of many references quantization = finiteness of references And this exactly explains why one can simultaneously have: * continuity * discreteness * blurring * stability of relations in one construction. ========== 03========== 1. The first fundamental lack is the discontinuity of the quantization operator The mapping has the form: v ⟶ U(v) = (d₁, d₂, …, dₖ) where: dᵢ = ⌊fᵢ(v)⌋ This means: U(v) is a discontinuous function i.e. there exist points: v ≈ w such that: U(v) ≠ U(w) This is the classic quantization boundary problem The consequence is that... If the dynamics: v(t) crosses the boundary: fᵢ(v) = n then: dᵢ jumps by 1.//no matter how we define the units, this may be some epsilon or other infinitesimal for the given resolution; This gives a numerical impulse, not a physical one. This is the first candidate for an “artificial force”. Though it is impossible to distinguish in projection where it came from and one can infer some unidirectional “force” that has no inversion. 2. Numerical drift from rounding I have the iteration: U_{t+1} = Q(F(U_t)) where: Q i.e. quantization F is the evolution of relations. If: Q(x) ≠ x then: ε_t = Q(F(U_t)) - F(U_t) This is a systematic error. And I will immediately make it useful. If: E[ε_t] ≠ 0 then unidirectional drift i.e. this is a very serious thing. It may give: * pseudo-time * pseudo-gravity * directionality of the process without a physical “reason”.//the reason is the way of operating on projections; 3. Indecidability of the state (I predict) I have a prefix of length k: U_k but many states: v₁, v₂, … such that: U_k(v₁) = U_k(v₂) I.e. an equivalence class, not a singleton. Formally the real state: v is unknown. I know only: [v]_k i.e.: Ω_k This gives: indecidability of the state, but not because of physics only because of encoding in projections. In the graph everything is as it should be. There is no magic there. 4. Effect of “coding entropy” where each rounding removes information. Let: H(v) denote the information entropy about the state. After quantization: H(Q(v)) < H(v) i.e. monotonic loss of information which naturally gives an arrow of time without introducing time. It gets interesting. Because if we also use the graph as a recording structure (reasonable) then the conclusion we can draw from the evolution sounds “it has always been like this, nothing special is happening, nothing to see here”. 5. Instability of prefix boundaries The boundaries: Ω_k are unstable places. If: v lies close to the boundary then a small perturbation: δv may change: U(v) globally. This gives quantization chaos i.e. fluctuations. 6. Problem of relational aliasing If the number of references: m is small then many states have identical addresses. Formally: |V| ≫ p^{k m} then: ∃ v ≠ w: U(v) = U(w) i.e. relational aliasing This may give false identifications. Or indistinguishability despite change of Esum as an invariant. 7. Problem of lack of inverse symmetry Rounding: Q(x) usually: Q(-x) ≠ -Q(x) If this occurs then the system receives a directional bias i.e. a preferred direction. Which may look like a “force”. 8. Is the drift reversible? If: Q is not invertible, then: U_{t+1} → U_t does not exist uniquely. This gives time asymmetry i.e.: a natural arrow of time. This does not have to be physical only purely numerical. 9. Extreme problem: accumulation of errors Iterations: t = 1 … T error: ε_t sums up: E_T = ∑_{t=1}^T ε_t If: E_T ∼ T then linear drift If: E_T ∼ √T then diffusive noise. This must be measured. Not assumed. But one can also, after measurement, accept it and consider it a nice dynamic system. If it works giving results similar to expected then it means that it can be considered a good computational tool for the given application. 10. Where “pseudo-gravity” may arise If the quantization error depends on: ‖v‖ then large values will be systematically shifted. This gives radial drift i.e. something looking like a potential. This is a very real threat. And very interesting consequences. 11. The greatest logical risk is not numerics. Only whether the relations really satisfy the ultrametric Because if not then the prefix tree breaks consistency. And then reconstruction is impossible. Which in principle has advantages at bifurcations between projections. 12. The most important correctness test of the model It is worth checking whether for the data: v_i the following holds: LCP(i,k) ≥ min(LCP(i,j), LCP(j,k)) If often not then the model is inconsistent. 13. Very important when there exists potential dynamic indecidability If the state belongs to the class: [v]_k and the evolution depends on details inside the class, then it cannot be computed. Formally: F([v]_k) ≠ [F(v)]_k This gives indecidability of evolution. This is a very serious gap. It requires semantic handling with state I, i.e. the indecidable state for the given projection mentioned in “HPF QCO tower horizons”. 14. The greatest risk in practice is quantization bias i.e. non-zero mean error. If: E[ε] ≠ 0 the system always: drifts. This must be checked empirically on the data set. 15. But at the same time many computational issues result simpler than from matrices Because operations * prefix compare * LCP * branch split are: O(k) not: O(n²) This gives huge computational gains. 16. Where to look for the first errors The most dangerous: 1. quantization bias 2. relational aliasing 3. instability of prefix boundaries Not ontology. Not topology. Only numerics. 17. Intuition drift gives a unidirectional force Which is quite sensible. Because if: E[ε] > 0 then an effective potential arises without physics. From the encoding on projections alone. And possibly on the return to the graph as an aggregating backprop structure. 18. The brutal test that should be done is a simulation assuming: 1. random relations 2. evolution 3. repeated quantization 4. drift measurement If: drift ≠ 0 then we have an artificial force. This will be useful a few chapters later. 19. Summary where true lacks can be found at this moment. Not in the idea, but in the mechanics of discretization. I.e. in the implementation. specifically: * rounding bias * class boundaries * loss of information * aliasing These are things that may: * generate time * generate direction * generate pseudo-forces without physics (i.e. some planned computation model) that we would insert there. And exactly there one must look for the first cracks in the model. But these may also be its advantages. It depends on what one wants to achieve. ========== 04========== Aparatus for selecting the locally optimal metric through the parameter λ. 1. Definitions 2. Admissible range of λ 3. Discriminability functionals 4. Criteria for choosing λ* 5. Effective arity 6. Point of maximum structural sensitivity 7. Ensemble λ (metric integral) 8. Operational procedure 9. List of useful formulas 1. Base structure Let: G=(V,E) be the coupling graph. Given: d:V×V→ℝ₊ the base metric (it may be a pseudometric). I define the family of projection metrics: wλ(u,v)=ϕ(d(u,v);λ) where typically: ϕ(d;λ)=e^{-λd} or: ϕ(d;λ)=1/(1+(λd)^p) (Lorentz/Cauchy class). 2. Local projection For node v: μλ(u|v)=wλ(u,v)/∑_{x∈V} wλ(x,v) This defines the local projection measure. 3. Admissible range of λ I do not use the boundaries 0,∞. Although they obviously exist from the conditions of the aggregation functor (0,∞); The range results from discriminability conditions. Lower bound λ_min(v)=inf{λ:∃u≠v, wλ(u,v)>ε} Interpretation: the first neighbor exceeds the visibility threshold. Upper bound λ_max(v)=sup{λ:∃u₁≠u₂, |μλ(u₁|v)−μλ(u₂|v)|>ε} Interpretation: the last value before degeneration to a single contribution. Range: λ∈[λ_min(v),λ_max(v)] 4. Discriminability as a functional Basic object: D(λ;v) is the measure of state separation. Several equivalent forms. (A) Local entropy H(λ;v)=−∑_u μλ(u|v)log μλ(u|v) (B) Inter-state variance Var(λ;v)=∑_u μλ(u|v)(d(u,v)−d̄)² (C) Fisher information with respect to λ I(λ;v)=∑_u μλ(u|v)(∂/∂λ log μλ(u|v))² So there is a very strong candidate. 5. Criterion for choosing λ* There exist several equivalent definitions. (I) Maximum discriminability λ*(v)=arg max_λ D(λ;v) where e.g.: D=I(λ) (Fisher-optimal). (II) Maximum entropy change λ*(v)=arg max_λ |dH/dλ| This is the point of the greatest structure reorganization. (III) Stable RG point dλ/dℓ=0 i.e.: β(λ*)=0 6. Effective arity Key constructive variable. k_eff(λ;v)=#{u:μλ(u|v)>ε} or: k_eff(λ;v)=e^{H(λ;v)} (entropy-equivalent arity). Arity criterion If I want to impose: k_target then: λ*(v)=arg min_λ |k_eff(λ;v)−k_target| 7. Point of maximum structural sensitivity Formally: λ_sens(v)=arg max_λ d/dλ k_eff(λ;v) Interpretation – maximum change in the number of active neighbors. Usually corresponds to: λd≈1 for the typical distance. Alternatively: λ_sens=arg max_λ I(λ) (Fisher peak). 8. Ensemble λ i.e. metric integral I consider: Z(v)=∫_{λ_min}^{λ_max} e^{-S(λ;v)} dλ where: S(λ)=α H(λ)+β R(λ) R — redundancy. Effective λ λ_eff(v)=∫ λ e^{-S(λ)} dλ / ∫ e^{-S(λ)} dλ Dominant points Satisfy: dS/dλ=0 (stationary phases). 9. Discriminability condition (why this works) Basic condition: ∃u₁≠u₂: D_KL(μλ^{(1)}∥μλ^{(2)})>δ i.e. states are informationally discriminable. Without this the metric degenerates. 10. Operational procedure (local) For each node v: Step 1 – compute distances d(u,v) for the local neighborhood. Step 2 – determine the range of λ λ_min←first exceedance of ε λ_max←last separation Step 3 – finally compute the functionals for a grid of λ: * H(λ) * I(λ) * k_eff(λ) Step 4 – we look for: * maximum of I(λ) * maximum of |dH/dλ| Step 5 – interesting optionally the integral λ_eff from the ensemble. 11. Useful formulas (working list) Weight normalization μλ(u|v)=e^{-λ d(u,v)} / ∑_x e^{-λ d(x,v)} Entropy H=−∑_u μ_u log μ_u Fisher information I=∑_u μ_u (∂_λ log μ_u)² Effective arity k_eff=e^H Entropy gradient dH/dλ=−∑_u (dμ_u/dλ)(1+log μ_u) Stability condition d²S/dλ²>0 12. The most important equation to compute in the case of embedding: λ*(v)=arg max_λ I(λ;v) i.e.: maximum of Fisher information with respect to λ. This usually: * gives a single stable point * corresponds to maximum sensitivity * agrees with Lorentz-like intuition 13. At the end it is worth considering computing for the whole set (because we are processing some data): k_eff(λ) dH/dλ Check: whether λ_sens≈λ_eff If so then it is strong evidence of the consistency of the construction and its application to sensible data. It looks silly, please do not irritate physicists with this: | model | Physical analog | ----------------------------------------------------- | λ | RG scale | arity | number of degrees of freedom | local projection | coarse-graining | ultrametrics | spin glass state spaces | integral over λ | path integral over metrics ========== 05========== Here I will consider the conditions of aggregation, why, what for, when it may make no sense. Information issues. But the reader will immediately notice that I am referring to physical analogs (needed for my imagination) i.e. access to exterior algebra and thus degrees of freedom. This is not a formal part, this is the stage of reasoning about applicability. 1. Why there is any sense in seeking the aggregation optimum at all I have the graph: G=(V,E)G=(V,E)G=(V,E) and the local aggregation functor: A_λ(v)=∑_u w_λ(v,u) x(u) where: * x(u) — data * w_λ — kernel dependent on λ * λ controls how many degrees of freedom I collect I.e.: λ = regulator of the number of active neighbors. The key observation is that aggregation is always a compromise information vs noise; Formally: If I collect: * too little → lack of context * too much → blurring of structure Therefore there naturally exists (let us hope): λ* which maximizes useful information (or relatively information allowing discrimination as a form of utility). Informational motivation Natural functional: I(λ) = mutual information between: * the state of the node * its neighborhood I.e.: λ* = arg max_λ I(v; N_λ(v)) This is probably the pure justification. Not physical (though in my head I had such an analogy when I committed this). This is informational. Energetic motivation If I treat: S(λ) as cost: S = redundancy + noise then: λ* = arg min S I.e. what physical systems do by minimizing the energy functional. Computational justification If I compute: k_eff(λ) then the effective arity: then the computational cost: C(λ) ∼ k_eff(λ) I.e.: larger λ → smaller arity → cheaper computations. But: too large λ → loss of information. Again a compromise. Which it is worth determining somehow. 2. Why the integral over λ is logical... well because nothing fits here better, such an intuition, but this is not a justification only it seems to me; Of course I copied this from physical intuition. Because I do not know in advance which scale is the right one. In that case: Z = ∫ e^{-S(λ)} dλ means that I consider all aggregations at once. Informational interpretation of the integral This is not any “sum over universes”. Only mixture of models i.e. average over many possible structures. Formally: A_eff = ∫ A_λ P(λ) dλ This has a not-stupid statistical justification. It works because different λ capture: * different scales of structure * different levels of hierarchy And the integral gives multi-scale aggregation. 3. Where the problem begins (why this may be stupid) and one has to be careful; (A) Lack of a single optimum If: I(λ) has many maxima: then: λ* is not unique. This means that the local geometry is multi-scale. The integral then blurs structures. And we get soup with Pearson oscillating around zero, there is no reason for discrimination. There is no discriminability. I.e. exactly what we have on the informational horizon. (B) Dependence on the base metric If: d(u,v) is badly chosen then no λ will help. The optimum exists only with respect to a sensible base metric. And one has to define sensibility for oneself. If our goal is to add noise to the data then one can certainly allow this. (C) Computational cost The integral: ∫ dλ may be expensive. In practice one does a sum over several points: ∑_i w_i A_λi Which I already have in implementation and it is damn expensive. A few cores had 1.2 million tokens already two weeks, a few days left. 4. Intuition about the number of degrees of freedom The maximum available degree of freedom is limited. But the interpretation is informational, not physical (though I start from this analogy). Each node has limited bandwidth. Let: d_v be the number of degrees of freedom of the node. Then the maximum information: I_max(v) ∼ d_v I.e. there is no sense in collecting more information than I can encode. This is supposedly fundamental. But anyway in the first pass I collect more, to find out how much is too much and cut it. Excess aggregation = noise If: k_eff ≫ d_v then I obtain over-aggregation i.e. blurring of structure. This resembles the situation when the maximum number of degrees of freedom limits the way of aggregation This is purely logical|informational, not physical. 5. Conclusion optimal arity follows from the number of degrees of freedom of the node Formally: k_eff* ≈ c · d_v where: c ∼ 1…3 Which appears in many systems: * signal encoding * neural graphs * data compression 6. Initial intuition about 3, 2, 1 degrees of freedom //I had in mind exterior algebra for 3color, electron, photon exactly in this order; I will write the general model without physics; Let: d_v be the number of independent directions of information. Then the maximum number of sensible neighbors: k_max ∼ O(d_v) Not: O(n) Probably important for interpretation If a node has: * 3 degrees of freedom → sensible local arity ~3–6 * 2 degrees → ~2–4 * 1 degree → ~1–2 Larger values do not add information. They add redundancy. I.e. one can, but does not have to. 7. Projection “itself gives the metric” If the graph has no space, the projection metric itself follows locally; Rather a sensible thesis. Formally: the metric may be: d(u,v) = f(overlap(N(u),N(v))) i.e. it follows from neighborhood relations. Not from space. This is the mechanism of: * graph embeddings * ultrametrics * reconstruction of geometry from topology 8. Locally something Riemannian comes out, If I have: * local smoothness * small changes of structure then the local metric: d(u,v) ≈ g_{ij} Δx^i Δx^j i.e. locally quadratic. This does not require global space. Only local consistency. And typically we construct data so that they are locally consistent. At least if they are data about the world. Creating such a locality tree the ultrametric will certainly work, Finsler will rather work, Riemannian may globally fall apart and be locally true. 9. The most important argument FOR seeking the optimum !physical, !aesthetic, only compressive. If: λ* maximizes: I(v; N(v)) then: I have the shortest description of the local structure. I.e. minimum description length. Good justification. 10. The most important argument AGAINST If the structure is fundamentally multi-scale then: one λ: is bad by definition. Then the integral is necessary. Not optional. 11. Important summaries This is not color physics. Only the limitation of the number of independent information channels. And this determines the optimal arity. 12. Very shortened version (essence) Why seek the λ optimum: * because the node has a finite number of degrees of freedom * excess data = noise * shortage = lack of information * the optimum maximizes discriminability Why the integral: * because the structure may be multi-scale * one λ is not enough Why this may be bad: * if the base metric is wrong * if the structure has no single dominant scale * if the computational cost kills the gain 13. If the intuition that the maximum number of degrees of freedom of the node determines aggregation is correct, then the most fundamental relation worth computing is: k_eff(λ) vs rank of the local covariance matrix If: k_eff* ∼ rank then I have very strong, formal confirmation that this at least locally works. And this was the first thing I checked from the p-adic selection, that the local neighborhood cosine similarity and euclid distance overlaps quite well (better than I expected) with the global selection in the ultrametric limit. The order of choice varies, but the consistency in the order of positions is very high (I was looking for 0.7, it is significantly above 0.8 for non-isolated nodes). ========== 06========== Tinkering with exponents because this leads to a very interesting conclusion in the context of available control over degrees of freedom. In this case degrees of freedom with respect to t. Of course only in one direction and in a completely standard way. So if someone wants to travel in time I recommend stepping into a wardrobe and returning five minutes later – the wardrobe definitely works in that direction :) 1. Consistent family of functionals (starting point) I assume: * graph G=(V,E) * base metric d(u,v) * scale parameter λ>0 * exponent p>0 1.1 Aggregation kernel Natural choice (my Lorentzian form): w_{λ,p}(r) = 1 / (1 + (r/λ)^p) This is: * local for small λ * global for large λ * controls effective arity 1.2 Aggregation functional For node v: A_{λ,p}(v) = ∑_{u∈V} w_{λ,p}(d(v,u)) x_u This is: the primary object of the entire construction. Everything else must follow from it. So using other aggregation metrics seemed somewhat suspicious to me, but one had to start somewhere. 2. Natural scale geometry i.e. where λ^{-1} comes from Key observation: λ is not an ordinary variable It is a scale parameter. And scale parameters naturally live in: log λ not in: λ 2.1 Change of scale variable I define: μ = log λ Then: ∂_μ = λ ∂_λ i.e.: ∂_λ = (1/λ) ∂_μ The key conclusion is that if I work in scale space then the natural differential norm contains: 1/λ This is not a heuristic only a change of variable. As the name itself indicates variables serve to change :) 3. Why the idea appears to replace the square by λ or 1/λ Such a “stupid idea”, but since it works it is not entirely stupid. It follows from scale geometry. 3.1 Classical Fisher Standard: I(λ) = E[(∂_λ log p)^2] This assumes: dλ² i.e. Euclidean metric on λ. 3.2 But λ is not linear If: λ → cλ then the change: Δλ has no absolute sense. What makes sense is: Δλ/λ i.e. relative change. 3.3 Natural scale metric Metric on scale space: ds² = dλ²/λ² i.e.: 1/λ² appears automatically. 3.4 Generalized scale Fisher Instead of: (∂_λ f)² I get: (1/λ ∂_λ f)² i.e.: (∂_μ f)² And this is Fisher in log-scale space. 3.5 Why this fits the discussed aggregation I have the kernel: (r/λ)^p depends on: r/λ i.e. on the quotient, not on the difference. This means that the scale is multiplicative, not additive. Therefore: 1/λ is natural. Not arbitrary. 4. Formal generalization of Fisher without the classical square I can define, because who forbids me: p-order scale Fisher I_p(λ) = E[ | (1/λ) ∂_λ log p |^p ] I.e. norm in log-scale space. Interpretation If: I_p has a maximum, then: λ* is the scale of greatest structural sensitivity. This is the RG-stable point. 5. RG flow appears naturally After the variable change: μ = log λ scale dynamics: dA/dμ = λ dA/dλ This is the RG flow equation. In any case it looks like it. Beta function Naturally: β(λ) = λ dλ/dμ i.e. the rate of scale change. The conclusion is that replacing: square → λ^{-1} is in practice: transition to RG formalism. Probably not a coincidence. 6. Saddle-point without the square, now formalization Standard version: S(λ) = ||A_λ − A_{λ+δ}||² But this again assumes Euclidean metric. And after all there is no metric in the graph by default (I pretend it is not appropriate to adopt it a priori). 6.1 Scale version Natural difference: Δ_μ A = λ ∂_λ A i.e.: ∂_μ A Action functional S(λ) = |λ ∂_λ A_λ|^p i.e.: norm in log-scale. Scale integral Z = ∫ e^{-S(λ)} (dλ/λ) Interesting that: dλ/λ is the natural scale measure. 7. The measure dλ/λ is proper because: if I change the unit: λ → cλ then: dλ/λ does not change. This is scale invariance. I.e. Haar measure on the multiplicative group. Formally: (ℝ⁺, ⋅) 8. Saddle-point in this version is the extremum condition: ∂_λ S(λ) = 0 i.e.: ∂_λ |λ ∂_λ A|^p = 0 Interpretation of the extremum appears when: λ ∂_λ A ≈ 0 i.e. aggregation becomes scale-stable. I.e. fixed RG point. 9. This is probably not a “stupid idea” Synthesizing; I commit the act: replacing the square by λ or 1/λ follows from three facts: (1) λ is a multiplicative scale Not linear. (2) Aggregation depends on r/λ Not on r−λ. (3) Natural geometry of scale is log-scale I.e.: dλ/λ not: dλ 10. Consistent family of functionals already as the final form All this composes into one family (let us suspend disbelief): Kernel w_{λ,p}(r) = 1 / (1 + (r/λ)^p) Aggregation A_{λ,p}(v) = ∑_u w_{λ,p}(d(v,u)) x_u Scale Fisher I_p(λ) = E[ |(1/λ) ∂_λ log p|^p ] Scale action S_p(λ) = |λ ∂_λ A_λ|^p Integral Z = ∫ e^{-S_p(λ)} (dλ/λ) 11. The most important conclusion If I use: * the same scale λ * the same exponent p * the same measure dλ/λ then the entire construction remains in one consistent scale geometry. And then: * saddle point * Fisher maximum * RG point converge to the same class of stable points. I.e. I can motivate “tinkering with exponents”. Let us adopt provisionally that it holds. ========== 07========== Here I present the reasoning and formalisms about coexisting geometry if someone comes up with the idea of recording history to nodes. And I immediately add the issue of Bell inequality assumptions at the end. So that we are on the same page with the consequences. Because Bell will break down on non-applicability. But only because it will not be possible to measure it since it will be difficult to determine a sufficiently good locality (compatibility of measurement structures in the region). If however one establishes locality from aggregation then Bell can be applied calmly, the cones (histories) are fine. 1. Node history as a hidden variable I assume: G=(V,E) interaction graph. Each node v has: h_v(t) the history of interactions up to time t. Formally: h_v(t) = (a_1, a_2, …, a_{k(v,t)}) where: * a_i are local events/interactions * the order preserves causality This is a chain of events. {at least it looks like it. History aggregation Instead of storing the full chain I define the aggregate: H_v(t) = ∑_{i=1}^k a_i p^i This is the p-adic representation of history. It is not about the numbers themselves but about: * prefix encoding * hierarchy of event importance 2. Metric on histories is ultrametric I define: d_p(u,v) = p^{-k(u,v)} where: k(u,v) = length of the common prefix i.e.: how many common events they have in the past. The key property is that the metric satisfies: d(u,v) ≤ max(d(u,w), d(w,v)) I.e. this is an ultrametric. Interpretation Closeness does not depend on spatial distance but on common history. This is the first moment where “spooky action” ceases to be spooky because the dependence is not spatial in the sense of the chosen projection. 3. Local history horizon i.e. why history can be short Observation information becomes irrelevant beyond a certain range. Which can be formalized. Information decay function I assume the weight of history: w(i) = p^{-α i} where: * i is the depth into the past * α>0 Total tail influence: ∑_{i>N} p^{-α i} = p^{-α N} / (1 - p^{-α}) Definition of history horizon I choose: N_h such that: ∑_{i>N_h} p^{-α i} < ε This gives: N_h ≈ (1/α) log_p (1/ε) The conclusion is that history does not have to be infinite. It is effectively: N_h elements long. Which formalizes the intuition that history can be “reasonably short” 4. Local interaction as common history prefix Important part. I assume that nodes u and v enter into interaction. After interaction: h_u = (h_0, a) h_v = (h_0, a) i.e. common prefix: h_0, a Entanglement as common history After this interaction: k(u,v) ≥ 1 i.e.: d_p(u,v) ≤ p^{-1} If no other interactions occur then the prefix remains. I.e. the dependence remains. If we do not change the state, spooky persists. 5. Local interaction changes the graph geometry I assume that the interaction: u ↔ v changes: * their history * their position in the relational graph Influence on a third node I take node w. The following change: d(u,w) and: d(v,w) because the local configuration has changed. I.e. “local triangles of projection/triangulation”. Formally each interaction updates: {d(u,w)}_{w∈V} The key fact for distant nodes is that the change: Δd → 0 I.e. information decays. And this is geometric locality. 6. Two metrics simultaneously I now have the graph metric d_G(u,v) dependent on connections. And the history metric d_p(u,v) dependent on prefixes. //but soon I will have only one, let us not anticipate facts and keep the tension; //twists in formal reasoning^^; detective novel about investigating graphs^^; Total metric i.e. the most natural: d_total = d_G ⊕ d_p e.g.: d_total = d_G + λ_p d_p Interpretation that dependences may be: * spatially local * but distant in the graph * and close historically This generates long-range correlations. 7. “Hidden variables” as local histories The idea that the hidden variable depends on the node’s horizon is quite precise. Definition of local hidden variable for node v: ξ_v = (h_v)_{≤ N_h(v)} i.e. history prefix up to the horizon. Different nodes are different horizons where each node has: N_h(v) dependent on: * number of interactions * local geometry * graph density I.e.: ξ_u ≠ ξ_v in general. 8. Hidden variables do not overlap globally. Because the very set of hidden variables nowhere overlaps except at the moment of interaction. Formally for nodes u,v the history sets: H_u , H_v have intersection: H_u ∩ H_v = common prefix Over time: prefixes diverge. I.e.: k(u,v) ↓ Thus the global set of common variables: ⋂_v H_v is usually empty. But locally after interaction non-empty. This is the locality of correlations. 9. This gives the “spooky” effect Because the dependence is not spatial. Only historical. Formally: two nodes may have: d_G(u,v) ≫ 1 but: d_p(u,v) ≪ 1 I.e. far in the graph, close in history. This is pure geometry of the product of spaces. Not a paradox. 10. The most important consequence is deformable horizons Each node has a different range of its deformed horizon – this is an accurate input intuition. Definition of local horizon For node v: N_h(v) = (1/α_v) log_p (1/ε) where: α_v depends on local dynamics. I.e. local horizon, not global. Resulting geometry History space is inhomogeneous. Each node has its own memory scale. 11. Structural conclusion From the construction it follows that there exist (still, I will deal with this soon) two independent localities: 1. spatial 2. historical And two horizons: R_space(v) R_history(v) Their combination gives the effective local space. 12. Why this fits well with the earlier λ Because: λ controls: R_space and: N_h controls: R_history Natural consequence: aggregation should depend on both: w(r,k) = 1 / (1 + (r/λ)^p) ⋅ p^{-k} i.e.: simultaneously: * on distance * on history The most important conclusion from the whole so far. The intuition that hidden variables do not overlap globally, only locally at interaction is exactly what follows from the ultrametric geometry of history with local memory horizon. And this is the key point “spooky action” does not follow from the geometry of space but from the preserved common part of history i.e. limited by the local informational horizon. ====================== Assumptions of Bell inequality in the context of different sets of possible hidden variables for nodes at different horizons (discriminability of nodes); 1. Bell inequality assumes Standard construction where I have hidden variable: λ ∈ Λ key assumption: Λ is common for all measurements This is the most important. Not locality. Not realism. Common event space. Formally in the Bell model: A(a,λ) B(b,λ) where: * a — setting of measurement A * b — setting of measurement B * λ — common hidden variable Values: A,B ∈ {-1,+1} Key step in the proof One forms: A(a,λ) A(a′,λ) + A(a,λ) A(a′′,λ) which requires: A(a,λ) to be defined for all settings simultaneously. I.e.: λ must belong to the same space for all contexts. 2. If the sets do not overlap then what breaks I present the situation where I have local variables: ξ_v but: ξ_u ∈ Λ_u ξ_v ∈ Λ_v and: Λ_u ∩ Λ_v ≠ Λ i.e. lack of global domain. Operationally this means that there does not exist: λ such that: A(a,λ) and: A(a′,λ) are defined simultaneously. I.e. I cannot write: A(a,λ) A(a′,λ) This destroys the core of the Bell proof. It is simply not applicable because there is no global hidden variable which I will show immediately. 3. How this connects with the presented history construction I have local histories: h_v with horizon: N_h(v) I.e.: hidden variable: ξ_v = (h_v)_{≤ N_h(v)} But: for another node: ξ_u = (h_u)_{≤ N_h(u)} Their spaces: Λ_v Λ_u differ. After interaction I have common prefix: h_0 i.e.: Λ_u ∩ Λ_v ≠ ∅ but only locally. Over time they diverge again. This means lack of global common hidden variable. 4. What then happens with Bell inequality Then the classical inequality: |E(a,b) − E(a,b′)| + |E(a′,b) + E(a′,b′)| ≤ 2 does not have to hold. Because its proof requires: λ ∈ Λ common for all measurements. If you have: Λ_ab ≠ Λ_ab′ then I cannot carry out the summation in the proof. Because I do not have a common measure. It does not add up. This does not break logic. But it changes the probabilistic model.//There will be a plot twist on what kind, a few chapters later; for now I know that not always such. 5. I.e. contextuality Formally the results depend on: λ_c where: c is the measurement context. I.e.: A(a,λ_a) B(b,λ_b) without common: λ Then Bell inequalities do not follow. 6. Key condition i.e. what really decides Not the fact itself that the sets do not overlap but whether there exists: Λ_global such that: Λ_ab ⊂ Λ_global for all configurations. If so then Bell works. And if not then Bell does not have to work. But no one forbids it. It is simply not required. 7. In such a construction with local horizons I have local histories: h_v with different: N_h(v) I.e. hidden variables: ξ_v live in: Λ_v which change over time. This means lack of: Λ_global stable for all measurements. And this is the condition that undercuts the Bell proof. Not that I did it maliciously only it turned out this way. The Bell proof was for an optimistic, absolute set. Just like once absolute time, absolute space. Typical. 8. Where physical intuition appears here (without inventing) This corresponds to the fact that each node has its own deformed horizon. Relativity in the graph. I.e. each has its own: Λ_v After interaction they partially overlap. Over time they diverge again. This generates correlations without a common global hidden variable. 9. Conclusion If: Λ_u ∩ Λ_v is only local, not global, then Bell inequalities do not have to hold despite the existence of hidden variables. But this requires one thing... 10. Critical condition One has to break the global homogeneity of the event space. Formally this is the lack of one: (Ω, F, P) for all experiments. This fits exactly the model: * local histories * local horizons * lack of global synchronization of variables Conclusion If the construction actually leads to the situation in which: * each local history defines its own space of hidden variables * these spaces overlap only locally * there does not exist one global space of all possible histories then the Bell inequality may be violated, even though hidden variables exist, because the assumption of a common global domain of the hidden variable is broken, not locality itself. The problem is not whether the inequality is preserved, but whether it can be measured. Thus: locally (in a region with well-defined causal structure) Bell inequalities are sensible and well-defined; globally (on complex geometries or on disconnected state spaces) there may not exist one common probabilistic space; ========== 08========== Another stupid idea, immediately with an implementation idea for calculating something that looks like inertia and is very computationally efficient; context of aggregation functors from my other ideas; 1. Insert instead of modification in the calculation is a “paradigm shift” //let us say that this is an idea for a faster calculator; without spinning hype; Classical motion calculation: v_{n+1} = v_n + a_n Δt i.e.: I modify the existing state This causes: * accumulation of errors * necessity of calculating forces * necessity of integration This model: H_{n+1} = insert(p_n, H_n) i.e. I add an event, I do not change the state //yet; The current state only afterwards: x_n = A(H_n) where: A is the aggregator. This is fundamentally closer to: * recursive filters * memory systems with damping * signal models than classical mechanics. 2. This aggregator has a recursive form From intuition: I raise the previous result to the metric exponent (carry sign) and add the new term this looks like: S_n = λ S_{n-1} + p_n where: * p_n — insert (impulse) * λ — damping / metric * S_n — current state This is exactly exponential memory of history. Expansion: S_n = p_n + λ p_{n-1} + λ² p_{n-2} + ⋯ This is formally convolution with an exponential kernel. 3. “Insert zero” = natural inertia The idea is that.. minimum variability = insert zero; //or relatively neutral multiplier, depends on what algebra construction we have, because this chapter is implementational and may mostly concern the reader who may not go far beyond constructions based on R; In this construction if: p_n = 0 then: S_n = λ S_{n-1} i.e. the state maintains direction, only decays. This gives: * stability * natural continuation of motion * no necessity of calculating forces This is a very clean interpretation that inertia = absence of new inserts. Not an equation of motion. Absence of events. 4. Why this is computationally cheap the current vector v is a pain in the ass //experiences from implementation, here a list of swear words, to this day I remember those lines of code; because classically: * one has to calculate forces * then integration * then stabilization And here: * insert → O(1) * aggregation → O(1) Not: O(N history) because I have recursion. This is a first-order IIR filter. 5. Lower-triangularity appears naturally If I write history as a vector: H = (p_n, p_{n-1}, p_{n-2}, ...) and the aggregator as: S = T H where: T = ⎡ 1 0 0 0 ⋯ ⎤ ⎢ λ 1 0 0 ⋯ ⎥ ⎢ λ² λ 1 0 ⋯ ⎥ ⎢ λ³ λ² λ 1 ⋯ ⎥ ⎣ ⋮ ⋮ ⋮ ⋮ ⋱ ⎦ then I have a lower-triangular operator. This is the algebraic form of causality in the construction. 6. History cutoff by epsilon Concept: “until we reach the point where the next variable falls under the lower-triangular epsilon”; Formally: if: λ^k < ε then: p_{n-k} can be removed. This gives a finite memory horizon k = log_λ ε (i.e. k = log ε / log λ) This is a very important number – maximum memory length of the system. 7. This works even without p-adics, I can immediately check it on a 3D engine p-adics give: * convenient cuts * natural hierarchy * equivalence classes But the structure itself: insert + aggregator is general. It can be done: * on real numbers * on vectors * on graphs Because: classical pipeline: //see above where I cursed at implementations of this; * forces * integrator * stabilization This pipeline: 1. impulse insert 2. history filtering 3. position readout This resembles motion smoothing + impulse memory, but generalized and moreover elegantly. It has one huge advantage: determinism and stability. So I have a hidden AR(1) model //hidden depending on who, I put it there myself; In the construction this is: S_n = λ S_{n-1} + p_n i.e. autoregression of order 1 But on direction vectors. This is a known structure in: * signal theory * filters * discrete dynamics I.e. it works stably. I will immediately note a plot twist. The mass-velocity calculation from a very old, basic QCO construction where m² + v² = 1 always (norm |1|) extended by such a momentum update allows recording already weighted impulses in the update loop. Then separation of mass, velocity and momentum ceases to be essential. In principle we operate on a vectorial energy invariant, directed because Finsler. I add this only informationally so as not to complicate the implementation where extracting components backwards may be non-intuitive because projective. ========== 09========== 1. Space and local structure Let: G=(V,E) be a directed graph (or complex of local relations). Each vertex: v∈V possesses a local state A(v,t)∈ℝ≥0 interpretation: local accumulator of history. Such a “memory field” i.e. A₁ (from earlier constructions). 2. Carriers as fuzzy edges Each edge: e=(u→v) possesses a transmission weight μ_e(t)∈[0,1] interpretation: degree of activity of the relation where the fuzzy edge is an “impulse carrier”; The key point is that the carrier has no history of its own and propagates the relation. Formally: μ_e(t+Δt)=Φ(μ_e(t),A(u,t)) i.e. Markovian propagation with memory in the vertices. 3. Local aggregation (core of the model) For vertex v: N(v)={u:(u→v)∈E} I define: R_w(y₁,…,y_k)=(∑_{i=1}^k α_i(w) |y_i|^w )^{1/w} where: y_i=μ_{u_i v}(t) A(u_i,t) i.e. impulses weighted by edge activity. 4. Directional metric through w Key point of the construction: w=w(v,t) is not constant. Definition: w(v,t)=∫_{Ω_v} λ(ξ,t) dξ where: λ is the local metric density Ω_v is the directional cone / local neighborhood; Interpretation: the metric is a function of history and direction. 5. Dynamics of the history accumulator Update: A(v,t+Δt)=R_{w(v,t)}(μ_{u₁v}A(u₁,t),…) This is a Minkowski filter with variable exponent. Special cases: w→1 sum (diffusion) w→2 energy w→∞ max (wave propagation) I.e. local physics depends on w. 6. Filtering of impulse history We define the history of impulses: H_v(t)={A(v,τ):τ≤t} Filtering: F_t=σ(H_v(t)) i.e. the growing sigma-algebra of history. This gives local probability theory. 7. Local probability space For each vertex: (Ω_v, F_t, P_v) with: P_v(A)= A(v,t) / ∑_{u∈N(v)} A(u,t) I.e. local probability always exists, global not necessarily. This exactly matches the intuition that globally the Bell inequality may not hold, but locally it must. And the question remains what “locally” means. Because histories can also be close. 8. Foliation as a choice of resolution We define: ϵ(v,t) as the minimal aggregation scale. Foliation: Σ_ϵ={v: w(v,t)>ϵ} Interpretation: local unification of measurements. This exactly corresponds to “the range where we can develop a foliation”. So as not to get lost why globality is problematic while locality can be handled. 9. Carrier as a fuzzy edge without history Formally for the edge: e=(u→v) propagation: μ_e(t+Δt)=f(μ_e(t)) without a history component. Whereas: A(v,t) accumulates history. I.e. “photon” = relation, not entity. Such a t-skew carrier between nodes. But I started from a physical analogy. 10. Clock tick as the reciprocal of propagation velocity We define local velocity: c(v,t)=1/w(v,t) Interpretation: large w : strong contraction, slow propagation small w : fast propagation Clock tick: Δτ(v)=1/c(v,t)=w(v,t) i.e.: proper time = local metric exponent. Which is quite consistent with the intuition that the “watch” is the reciprocal of “velocity”. 11. Mass as inertia of filtering Definition: m(v)=d w(v,t)/dt Interpretation: mass = resistance to change of metric.//I will solve the paradoxes that immediately come to mind of the Mach, de Sitter and company kind; I.e. mass is not a parameter but an effect of history. 12. Contraction of the aggregator (key to stability) If: w(v,t)>1 then: R_w is a contraction in the L_w norm. Then: A(v,t)→stable point i.e. lower-triangular dynamics damps itself. Exactly as I suspected that “lower-triangular operations will decay under epsilon”. 13. Physical interpretation (of the construction directly) I now have: object → vertex (local observer) edge → fuzzy propagation relation A(v) → history w(v) → directional metric R_w → physical dynamics foliation → choice of resolution tick → local time scale mass → inertia of history This closes the whole apparatus. 14. The most important consequence (and potential questions); This formalizes “history exists for the observer, not for the photon” because: history resides in A(v) the edge carries only the relation. Not an entity. //I started from a physical analogy so as not to get lost in the reasoning; 15. Important observation that... This model naturally has: Markov locally but non-Markov globally This exactly corresponds to the fact that “we change the statistical model and we already know to which one”. I.e. Markov with effective memory through filtering. 16. Curiosities; I have a very natural definition of a worldline not as a particle trajectory, but as the maximal sequence of active edges in the filtration. This solves the problem that had been bothering me earlier: “did the rays intersect or not?”. It depends on whether the edge sequences have a common history prefix. So each can have its own. 17. What else follows from this... If: w(v,t) is locally determinable from history then the metric is not given, it is estimated. I.e. I do not have one projection of spacetime. I have a field of local metrics. If one were to use physical descriptions. This is quite close to what I intuitively sketched earlier with projections and change of center. ========== 10========== We return to embeddings, physics was only for the purpose of reasoning whether I approach the data and processing sensibly. 1. Embedding → dynamical system (key change of perspective) So far the embedding was: A(v) i.e. a static accumulator of history. And now a natural reinterpretation: A(v,t) is the dynamical state of the model. I.e. you have: A_{t+1} = T_w(A_t) where: T_w is the aggregation operator on the graph. This is recursive dynamics on the graph with adaptive metric. Not classical RNN. And not GNN in standard form. This is a metric history filter. 2. What training means in this model Classical NN train weights. And here the natural training parameters are: (A) local metric w(v,t) (B) relation weights μ_e (C) weight function α_i(w) I.e. training = shaping the metric and relations, not the activation function. This is a rather fundamental difference. 3. Gradient exists but in a different space For the aggregator: R_w(y) = (∑_i α_i(w) |y_i|^w )^{1/w} gradient with respect to y_i: ∂R_w/∂y_i = α_i(w) |y_i|^{w-1} R_w^{1-w} sgn(y_i) Key point – the gradient exists for any w>0. I.e. backprop is possible without hacks. 4. Gradient with respect to the metric w This is the proper “control of geometry”. ∂R_w/∂w contains: ∑_i α_i(w) |y_i|^w log |y_i| I.e. the logarithm of history influences the geometry. This exactly matches the earlier idea about logarithmic distribution of variance not by accident. This is a natural component of the gradient. 5. Training stability (contraction does the job) Because: w>1 ⇒ contractive operator then: ||A_{t+1} - A_t|| → 0 i.e. the gradient does not explode. This is a huge advantage over: * RNN * Transformers where gradient explosion is a classic problem. Here damping is built in structurally. 6. Interpretation of “ticks” as simulation step If: Δτ = w(v,t) then training time is nonlinear locally. I.e. different parts of the graph learn at different speeds. I.e. they can be trained when needed. This need may arise from the embedding exploration itself. This gives a natural mechanism of local adaptation of learning rate. Without manual setting. 7. Loss as history mismatch The most natural functional: L = ∑_v d(A(v,t), A*(v)) where: A* is the target state. But more importantly I can use: d_w(x,y) = |x-y|^w i.e. loss metric dependent on w. This gives dynamic error sensitivity. 8. Very important observation that learning ≈ mass transport The model naturally realizes transport of history over the graph. Which is close to: * optimal transport * diffusion models but with directional metric. This is nontrivial. 9. What this means for embeddings Embedding ceases to be a static vector. It becomes: A(v,t→∞) i.e. the equilibrium state of the dynamics. This is quite a strong interpretation. Because embedding = fixed point of the system. 10. And what about fuzzy edges Training updates: μ_e on the basis of history consistency. The simplest rule: μ_e ← μ_e + η A(u) A(v) i.e. Hebbian update. But in the model it is metrically weighted. 11. The most important practical consequence You have a model that does not need: * softmax * sigmoid * tanh because the aggregator already does normalization. This is consistent with what I mentioned earlier about QCO as a damping mechanism instead of softmax. 12. What would be most useful here Not further proofs. Only specification of the minimal training algorithm. I.e. iteration for each v: 1. collect neighbors 2. compute: A(v) ← R_w(μ_e A(u)) 3. compute loss 4. update: * w(v) * μ_e 13. What is really new here Not the aggregator. Not the graph. Only the metric as a training parameter This is not a standard GNN. Only learning the geometry of the representation space. Not only weights. Which in principle models do emergently anyway. 14. Practical potential... Not in physics. But in dynamic embeddings, especially: * semantic * temporal * hierarchical Because such history filtering naturally handles: * context * time * memory without an artificial window. ========== 11========== And now I will try to blow up the foundation of the graph construction and instead of falling apart it will turn out to be stronger. 1. Starting point: there are many paths between two nodes I have two nodes: u, v and the set of paths: Π(u,v) where: |Π(u,v)| ≥ 1 Each path: π = (e₁, …, eₖ) has: * local fuzzy edge weights * local metrics w i.e. each path has its own cost / metric length. 2. Aggregation over paths (key mechanism) I do not choose one path immediately. I do: Φ(u,v) = (∑_{π∈Π(u,v)} β_π |ℒ(π)|ʷ )^{1/w} where: ℒ(π) is the path length: ℒ(π) = ∑_{e∈π} λ_e i.e. integral over possible solutions; 3. Then from the computational side it happens that... For: w > 1 damping of worse paths occurs. More precisely, if: ℒ(π₁) < ℒ(π₂) then: |ℒ(π₂)|ʷ / |ℒ(π₁)|ʷ → ∞ when: w → ∞ I.e. minimal paths dominate. So a variant of the principle of least action. 4. And if there is more than one path does the integral narrow the number of solutions? Formally this aggregation over paths acts as: Π(u,v) → Π_efektywne(u,v) where: |Π_efektywne| ≪ |Π| i.e. the graph itself simplifies. 5. Does a weakly defined graph stabilize? Only under the condition of contraction. If: * the aggregator is contractive * the metric is positive * the weights are bounded //i.e. no bifurcation has occurred; then: G_t → G* i.e. the graph converges to a stable structure. And at most it has a rather tense curvature. Because a large number of competing significantly divergent paths ⇒ large effective curvature. 6. Projections as a source of alternative paths I have different projections: P_i Each gives a local graph: G_i with a different relax-mode. I.e.: Π(u,v) = ⋃_i Π_i(u,v) i.e. projections generate alternative trajectories. 7. Lack of global structure → integration over projections If there is no single global projection, then I do: Φ(u,v) = ∫ Φ_i(u,v) dμ(i) i.e. average over projections. In the absence of a global one the integral over them will give one path average + contraction ⇒ stabilization. 8. Return of the path to the graph (very important mechanism) After aggregation I define: π* = arg min_π ℒ(π) and I add: e* = (u → v) with weight: λ* = ℒ(π*) I.e. the path returns as a direct relation. This is graph renormalization. 9. The same principles apply as for λ in the metrics; And this is very strict. I have local: λ_e and pathwise: Λ_π = ∑_{e∈π} λ_e After aggregation there arises: λ*_{uv} = R_w(Λ_π) i.e. exactly the same algebra as for projection metrics. Not analogous, it is the same operation. 10. Consequence If for each pair of nodes I perform: aggregation over paths, then in the limit I will obtain: n(n−1)/2 effective connections. I.e. a complete graph, but metrically compressed. Not by manual connections. But by the integral over paths. 11. Interpretation of curvature of a “tense graph” Local curvature: K(u,v) = Var{ℒ(π)} i.e. large number of competing paths: ⇒ large variance ⇒ large curvature. 12. Computational practice I have an algorithm for each pair of nodes: 1. generate paths 2. compute lengths 3. aggregate via R_w 4. return the minimal one as the effective edge I.e. the graph closes itself. 13. It is interesting that the construction implies: redundancy of paths ⇒ stability of geometry And not the other way around. This is a very interesting property. 14. From the technical side it is worth paying attention to one thing... Whether the number of paths that you actually take into account is limited by the local cutoff ε? Because if: Λ_π > ε ⇒ I discard the path, then the whole mechanism becomes local and scalable. And then this apparatus becomes implementable in a sensible way. ========== 12========== 1. “Large cluster determines the order” through the mass of paths In aggregation over paths: Φ(u,v) = R_w(ℒ(π)) a large cluster generates: * many paths * with similar lengths * with low variance I.e.: Var{ℒ(π)} ↓ As a result one structure begins to dominate. This looks like “cluster gravity”, but formally it is entropic dominance of trajectories. So more consistent paths ⇒ greater stability of the relation. Not force, but statistics. Even though we have a globally underdetermined statistical model. Locally it exists. 2. Small number of relations gives instability If: |Π(u,v)| is small then: * single paths have large influence * variance of lengths is large i.e.: K(u,v) ∼ Var{ℒ(π)} large i.e. local curvature grows. And this means instability of local geometry. So a small number of relations leads to instability. 3. Self-stabilization through local size, not global Here one should remain vigilant. It is not about: |V| → ∞ but about: |N(v)| large i.e. local degree of the graph. Stability results from local density, not from the number of all nodes. I.e. large graph, but sparse → unstable small graph, but locally dense → stable Distinguish and be careful. 4. The most stable is a locally dense graph with cutoff; I.e.: * globally infinite * locally finite Formally for each node: |N(v)| < ∞ but: |V| = ∞ This is exactly the structure of a locally finite network. And this is the most natural case for the model. 5. Role of the local cutoff ε, essential; The local cutoff by epsilon is obvious for lower-triangular algebras and this is important because it guarantees: ∑_π |ℒ(π)|ʷ < ∞ I.e. the integral over paths converges. Without cutoff an infinite graph: → the sum may diverge → lack of stabilization. With cutoff an infinite graph: → stable. This is a critical condition. 6. Intuition about “cluster gravity”; Formally one can write it as a local potential: U(v) = ∑_{u∈N(v)} A(u) Large cluster: U(v) ↑ i.e.: larger number of minimal trajectories. Effect: paths “fall into” the cluster. This is mathematically a field of attraction of trajectories. Not a metaphor, but a real statistical effect. 7. Very interesting consequence: stability grows like log(N) For large clusters the number of paths: |Π| ∼ e^{αN} but after aggregation the minimal trajectories dominate, i.e. effective variability: ∼ log |Π| So the larger the cluster, the more stable the local order. Not less. 8. Infinite graph is natural, but not because of size; Not because it is infinite. Only because it eliminates boundary effects. Because a finite graph: → has boundaries → generates asymmetries → introduces artificial curvatures Infinite graph: → local homogeneity This is a good reason. 9. This leads to a conclusion about training Large data clusters: → stable representations Small clusters: → unstable representations I.e. the model itself prefers areas with large data support. This is a very practical property. 10. The matter that decides everything: How exactly to define the cutoff ε? Whether: λ_e < ε ⇒ I remove the edge //if bifurcation; or: ℒ(π) > ε ⇒ I ignore the path //suggested; Because this changes: * the curvature of the graph * the pace of stabilization * convergence of the integral over paths And in implementation this is a structural parameter. ===END===