← Back to trust

Bittensor

Bit Off

Judgment, value, and delegated authority (Bittensor)

Contents

Abstract

Bittensor is described in its foundational materials and by subsequent commentary as a market for intelligence. The deployed protocol architecture defines a narrower institutional object. The subtensor runtime does not formalize intelligence, usefulness, or quality; it formalizes stake-weighted authority over judgments that originate in subnet-specific validator practice. Validators produce off-chain evaluations through subnet-specific scoring procedures, submit weight vectors to the runtime, and receive dividends through mechanisms in which stake appears at every decisive step of aggregation, consensus, rank generation, and reward distribution.

Two mechanisms organize the analysis. Capitalized judgment names the architectural posture: the protocol delegates evaluation to residual validator practice while formalizing capital-weighted authority over whose judgments count, how much they count, and how rewards are distributed. Signal-capital substitution names the candidate failure mode in which capital-weighted authority is weakly calibrated to evaluative competence, so the protocol rewards capital-positioned judgment more than well-grounded evaluation.

The analysis draws on construct-validity theory (Cronbach and Meehl 1955; Adcock and Collier 2001), commensuration research (Espeland and Stevens 1998; Porter 1995), and classification theory (Bowker and Star 1999), then reads the subtensor runtime at pinned commit references most recently refreshed May 4, 2026, BIT governance proposals (BIT-0002, BIT-0004), Lui and Sun’s (2025) system-wide empirical study, and a bounded packet trace of the historical OpenValidators mirror for SN1 text-prompting. The result is a map of the residual-judgment layer where evaluation occurs, the runtime mechanisms that convert those judgments into authority-bearing consequence, and the institutional conditions under which capital-weighted authority can diverge from evaluative substance.

Keywords: Bittensor, subtensor runtime, capitalized judgment, signal-capital substitution, stake-weighted authority, evaluative competence, construct validity, commensuration, classification theory, residual formalization, decentralized AI markets, institutional economics.


1. Introduction

Bittensor was introduced in Rao, Steeves, Shaabana, Attevelt, and McAteer’s (2020) whitepaper as a peer-to-peer intelligence market, an architecture in which intelligence systems would rank one another through Fisher-information logic and accumulate reward in proportion to demonstrated evaluative competence. The deployed system five years later is structurally different from that original vision. The subtensor runtime does not implement Fisher-information ranking; it implements stake-weighted aggregation over validator-submitted weight vectors. The subnet architecture does not settle a protocol-native measure of quality; it leaves task definition, querying, scoring, and weight production to subnet-specific validator practice. The runtime’s formal core concerns authority over evaluations, while evaluative substance remains residual to subnet practice.

This architectural pattern (a protocol that formalizes authority over judgment while leaving the underlying judgment to residual practice) sits between the general blockchain-governance literature and the AI-evaluation literature. Blockchain-governance scholarship analyzes decentralized decision-making through institutional economics, political economy, and legal theory. AI-evaluation scholarship develops benchmarks, evaluation methodologies, and critiques of measurement in machine-learning research. The intersection, where a protocol capitalizes evaluative judgment over contested constructs, needs resources from both traditions.

Two mechanisms describe the architecture. Capitalized judgment names the architectural posture: the protocol leaves evaluation itself to subnet practice while formalizing stake-weighted authority over those residual evaluations. The posture is visible in the subtensor runtime at specific code locations (validator-permit logic, consensus aggregation, miner-rank computation, dividend logic) and is expressed institutionally through the separation of task definition (residual, subnet-level) from reward-bearing authority (formal, runtime-level). Signal-capital substitution names the candidate failure condition: when capital-weighted authority is weakly calibrated to evaluative competence, the protocol rewards capital-positioned judgment more than well-grounded evaluation, and the reward distribution diverges from any independent quality signal.

The two mechanisms generate empirical tests. Capitalized judgment is visible from runtime architecture alone; the code shows where stake enters and where it does not. Signal-capital substitution requires external calibration evidence: comparison of capital-weighted reward allocation against independent performance benchmarks. The analysis engages the available system-wide evidence (Lui and Sun 2025), which operates within the protocol’s own evaluative vocabulary and therefore supports weaker claims than external-benchmark evidence would support. It then develops a bounded packet trace for SN1 text-prompting that narrows the empirical burden to specific subnet-level questions that future research can address.


2. Residual Formalization in Bittensor

Bittensor is a core case of residual formalization in cryptoeconomic systems. Its formalization choice is unusually legible: the subtensor runtime formalizes authority over weight submission, consensus, reward, and continuation while leaving task definition, querying procedure, scoring rubric, and construct operationalization to subnet-specific validator practice. The sense of residual is Grossman and Hart’s (1986): the runtime specifies particular rights over submitted scores (aggregation, consensus, reward distribution), while residual rights of control over everything the coded layer does not specify (construct operationalization, task design, scoring practice) remain with subnet validator practice. The core question follows from that allocation: when a protocol formalizes reward-bearing authority while leaving evaluation residual, where does evaluative authority settle and how is it converted into economic consequence?

Residual evaluation, produced in subnet-level validator practice, is captured by the runtime and transformed into authority-bearing consequence through stake-weighted aggregation. Whether that transformation preserves, distorts, or breaks the connection between evaluative substance and reward-bearing authority is the calibration question named by capitalized judgment and signal-capital substitution.

Bittensor’s compound effect operates through commensuration: the runtime converts heterogeneous subnet-level judgments about inference usefulness, data value, speed, diversity, benchmark performance, or task-specific contribution into a common reward-bearing scale. Espeland and Stevens (1998) identify commensuration as a social process in which qualitatively distinct things are made comparable through a shared metric. Bittensor performs commensuration at runtime scale. The scale, reward-bearing weight over TAO emissions, makes no claim about the underlying constructs; it provides a mechanism for ranking and rewarding them. What the runtime does not formalize, construct validity, evaluator competence, and scoring consistency, becomes the site where institutional consequence accumulates.


3. Theoretical Resources

Four analytical traditions supply the analysis. Each is assigned a specific analytical role.

3.1. Construct validity in psychometric theory

Cronbach and Meehl’s (1955) foundational framework for construct validity specifies the conditions under which an operational measure can claim to measure the underlying construct it purports to measure. The framework distinguishes three validity types: content validity (does the measure cover the relevant domain?), criterion validity (does the measure predict the outcomes it should predict?), and construct validity (does the measure connect to the theoretical construct through a network of expected relationships?). Cronbach and Meehl’s central insight is that construct validity cannot be established through any single empirical test; it requires the accumulation of evidence across multiple studies that collectively support the inference that the measure is connected to the underlying construct in the way theory predicts.

Adcock and Collier (2001) extend the construct-validity framework to qualitative and quantitative social-science research. Their central distinction is between the background concept (the abstract construct as it exists in theoretical discourse), the systematized concept (the construct as defined for a specific research purpose), the indicator (the operational measure), and the scores (the specific values the measure produces for specific cases). Each transition (background to systematized, systematized to indicator, indicator to scores) requires explicit justification, and validity failures can occur at any transition.

For Bittensor specifically, the construct-validity framework clarifies what the runtime does and does not establish. The background concept is intelligence, quality, or evaluative competence. The systematized concept varies across subnets: different subnets operationalize quality differently based on the task (inference quality, data value, speed, diversity, and so on). The indicators are subnet-specific validator scoring procedures. The scores are the weight vectors validators submit to the runtime. At each transition, validity failures can occur: the systematized concept may not correspond to the background concept, the indicator may not correspond to the systematized concept, and the scores may not reflect the indicator accurately.

The runtime does not evaluate any of these transitions. It aggregates scores (weight vectors) through stake-weighted consensus and converts them into reward-bearing authority. The runtime has no mechanism for detecting or correcting validity failures at any earlier transition.

3.2. Commensuration as a social process

Espeland and Stevens (1998) identify commensuration (the transformation of qualitatively distinct properties into a common metric) as a pervasive social process with specific cognitive, institutional, and political consequences. The process enables comparison and choice across domains that would otherwise be incommensurable, but imposes specific costs: information loss, distortion of what is being measured, and displacement of qualitative judgment. Commensuration is not ideologically neutral; it privileges what can be measured and marginalizes what cannot.

Porter (1995), in Trust in Numbers, develops the historical and sociological analysis of quantification as a response to distrust of expert judgment. Quantification offers the appearance of objectivity, impersonality, and replicability; it transforms expert judgment into impersonal procedure. The transformation has material benefits (accountability, portability across contexts, amenability to audit) and specific costs (suppression of contextual knowledge, privileging of what is easily counted, vulnerability to metric-gaming).

For Bittensor, commensuration and quantification operate at protocol scale. The runtime converts heterogeneous subnet-level judgments into a single reward-bearing metric (TAO emissions). The commensuration enables ecosystem-level resource allocation but imposes the costs Espeland-Stevens and Porter identify. The commensuration framework specifies what the runtime accomplishes institutionally: the conversion of judgments about quality into a common reward scale. The political-economic consequences of this commensuration (who benefits, who is disadvantaged, what kinds of work are rewarded) follow the patterns these authors identify for other commensurated domains.

3.3. Classification and infrastructure

Bowker and Star (1999), in Sorting Things Out, develop the analytical framework for understanding classifications as infrastructure: embedded, invisible, persistent systems that shape what can be said, done, and measured within a domain. Classifications are neither neutral nor complete; they impose visibility and invisibility through their categories, make certain kinds of work countable and certain kinds uncountable, and create institutional inertia that makes classification change slow even when underlying conditions change.

For Bittensor, the classification framework applies at two levels. At the subnet level, subnets are classified by task type (text-prompting, data-indexing, price-oracle, and so on), with each classification producing specific evaluation methodologies and reward patterns. At the runtime level, the subtensor architecture classifies actors as validators, miners, subnet owners, root holders, and so on, with each classification carrying specific rights, obligations, and reward-eligibility. The Bowker-Star framework clarifies that these classifications are not descriptions of pre-existing distinctions; they are infrastructure choices that shape what work can be done, what can be measured, and what can be rewarded.

3.4. Token-incentive design as secondary resource

Cong, Li, and Wang (2022) on token-based platform finance and Cong, Fox, Li, and Zhou (2025) on oracle economics provide secondary analytical resources for token-design analysis. The first addresses platform-token value capture and commitment; the second formulates the oracle problem, the trade-off among scalability, decentralization, and truthfulness, and the incentive designs that aim to keep on-chain reports truthful. For Bittensor, these frameworks inform the token-design analysis without supplying the primary analytical resources. Bittensor’s capitalized-judgment mechanism operates at a distinct level (protocol-mediated evaluation of contested constructs, where the reported object is an evaluation rather than an external fact) that the oracle and token-economics literatures do not address directly.


4. Prior Art and Analytical Distinction

Five clusters of prior art bear on the analysis.

4.1. AI-evaluation and benchmarking literature

The AI-evaluation literature addresses benchmarking methodology and the epistemology of model evaluation. Liang et al. (2023) on the HELM (Holistic Evaluation of Language Models) benchmark develops a systematic framework for evaluating language models across multiple metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). Burnell et al. (2023) analyze the construction and limitations of AI benchmarks, identifying systematic biases in benchmark design and application. Raji, Bender, Paullada, Denton, and Hanna (2021) examine how general benchmarks are constructed and used, identifying construct-validity failures in the benchmark ecosystem; Raji, Kumar, Horowitz, and Selbst (2022) show that policy and evaluation debates routinely presume functionality that deployed systems do not exhibit.

These works operate at the benchmark-design level and address how AI evaluation should be conducted. The present analysis operates at a different level: how a protocol capitalizes evaluative authority over contested constructs regardless of the benchmarking methodology subnets adopt. The prior art is adjacent but does not address the specific protocol-level capitalization mechanism developed here. Where a subnet uses HELM-style benchmarking as its scoring methodology, the analysis still applies because the runtime’s stake-weighted authority operates over the submitted judgments regardless of how those judgments were generated.

4.2. Blockchain-governance frameworks

Blockchain-governance analytical frameworks (van Pelt, Jansen, Baars, and Overbeek 2021; Schädler, Lustenberger, and Spychiger 2023; Alston, Law, Murtazashvili, and Weiss 2022) provide general tools for analyzing decentralized governance. Davidson, De Filippi, and Potts (2018) position blockchain as institutional technology within Williamson’s framework. The general frameworks inform the background without driving the analysis. The specific analytical object (protocol-mediated evaluation of contested constructs) is not the blockchain-governance literature’s primary focus.

4.3. Token-weighted aggregation and crypto measurement

Two formal literatures contain precedents for the mechanisms named here, and the analysis positions itself against both. Tsoukalas and Falk (2020) model token-weighted crowdsourcing directly: token-weighted voting generally degrades information aggregation relative to flat weighting, with accuracy declining in the dispersion of token holdings unless voters strategically unravel the mechanism. Signal-capital substitution is the institutional form of the condition their model isolates, reached architecturally rather than through equilibrium analysis: where the Tsoukalas-Falk results specify when capital weighting destroys aggregation within a single voting mechanism, the account here traces how a runtime embeds capital weighting at every decisive aggregation point and converts the resulting judgments into reward-bearing authority. Peer-prediction theory supplies the second precedent. Mechanisms that reward agreement among raters admit uninformative equilibria in which reports carry no signal (Miller, Resnick, and Zeckhauser 2005; Jurca and Faltings 2009); consensus-conformity capture is the stake-weighted variant of that equilibrium class, with Yuma consensus rewarding alignment with stake-weighted majority opinion. The contribution relative to both literatures is the construct-validity decomposition and the institutional reading: the formal results establish that agreement-rewarding and capital-weighted mechanisms can detach rewards from underlying quality, and the present analysis locates where, in a deployed runtime, that detachment can enter and what institutional arrangement converts it into economic consequence.

Cong, Li, Tang, and Yang (2023) document a related measurement gap at the exchange layer, showing that reported trading volumes on unregulated centralized exchanges diverge systematically from detectable authentic activity. Their gap lies between self-reported firm metrics and verifiable behavior; the Bittensor case concerns a structural gap between protocol-formalized authority and underlying evaluative substance, which no audit of reported figures would surface.

4.4. Existing Bittensor literature

Rao, Steeves, Shaabana, Attevelt, and McAteer (2020) provide the original Bittensor whitepaper. The whitepaper describes a Fisher-information-based ranking architecture that differs from the deployed subtensor runtime. It serves as a historical anchor and as evidence of the architectural shift from original specification to deployed implementation.

Lui and Sun (2025) provide the first system-wide empirical study of Bittensor, analyzing 6,664,830 events across 64 active subnets and 121,567 unique wallets through February 12, 2025. Their study reports high validator stake-reward correlation, weak miner performance-reward correlation, and extreme concentration. Their performance measure is Bittensor’s own trust metric, so their results support internal-vocabulary claims, with external-benchmark claims still requiring separate evidence. The Bittensor analysis extends the framing in two directions: the architectural-mechanism account of why the observed patterns arise, and the construct-validity framework that clarifies why internal-vocabulary evidence supports weaker claims than external-benchmark evidence would support.

4.5. Platform labor and evaluation literature

Gray and Suri (2019) analyze the institutional patterns by which platforms extract value from evaluative work performed by workers or participants who may not receive commensurate compensation or voice. The framework is adjacent to Bittensor. The case shares the general structural feature of platform-mediated evaluation of contested work without exhibiting the specific labor-relations features the platform-labor literature addresses. Bittensor validators are token-holding participants with stake in the evaluation system they operate. The analytical frameworks inform the background without supplying the primary resources.


5. Method

The analysis uses primary-source reconstruction of the subtensor runtime combined with governance-proposal analysis and empirical engagement with the Lui-Sun (2025) system-wide study and a bounded packet trace of SN1 historical OpenValidators data.

Runtime sources: the subtensor runtime at commit f65a8a3b093863803500406d4dd8a661d9f88989 and the March 19, 2026 deployment anchor 7a727dd4d219a953391e91ed2f7aa942050938f1. The bittensor-subnet-template repository at commit c355fbf44df72b1f526d723d5fc3a8179c16b9af. Runtime and governance sources were refreshed on May 4, 2026. The public opentensor/subtensor main reference resolved to 40a451f366900d00ec0b3781e4c5a4a92ba9a6b6; the cited runtime commit remains the pinned analysis surface. Specific code locations are cited where the analysis depends on runtime logic.

Governance sources: BIT-0002 (dTAO Subnet Start Call), BIT-0004 (Subnet Deregistration), and the broader BIT process documentation. These governance proposals document specific pressures within the system and indicate where governance action has been taken to address identified problems.

Empirical sources: Lui and Sun (2025) as the primary system-wide empirical anchor; the historical OpenValidators mirror for SN1 text-prompting as the bounded packet-trace source. SN1 is selected as a revealing case: the mirror preserves netuid = 1 run metadata and raw parquet rows with prompts, queried uids, completions, rewards, blocks, and run-local snapshots, enabling reconstruction of a concrete validator workflow with event-level linkage from queried validators through completions, rewards, penalties, and selected outputs.

Scope limits are substantial: the record is architecture-first, the system-wide study relies on a protocol-native proxy, and calibration claims need an adjudication layer external to the subnet’s own reward machinery. Part 13 states the full boundaries.

Method discipline: all factual claims about runtime architecture, governance proposals, and empirical findings are traceable to specific source documents. Claims about capitalized judgment are developed from runtime-architecture evidence; claims about signal-capital substitution are developed from governance-evidence, system-wide empirical findings, and architectural analysis of failure modes, and are distinguished from claims about calibration at specific subnets, which require subnet-level evidence not presently available.


6. Bittensor Architecture: Residual Judgment and Formalized Authority

6.1. What the runtime formalizes

The subtensor runtime at the cited commit implements specific functions that together constitute the protocol’s formal core. Validator eligibility is determined through is_topk_nonzero(&stake, max_allowed_validators as usize), which grants validator permits to the top-k stake holders. Active stake is computed by masking for inactivity and permit status. Weights from non-permitted validators are dropped. Consensus over submitted weights is computed through weighted_median_col_sparse(&active_stake, &weights, n, kappa). Miner ranks are computed through matmul_sparse(&clipped_weights, &active_stake, n). Validator dividends are computed through mechanisms in which stake appears at multiple decisive points (permit eligibility, active-stake weighting in aggregation, bond-weighted returns, emission normalization). At the subnet-emission layer, TaoFlow allocates emission shares through get_shares, which at the pinned commit delegates to the flow-sensitive implementation get_shares_flow, with the exponential moving average over net TAO flow computed in get_ema_flow, and without independent assessment of subnet quality.

The runtime formalizes authority over submitted judgments. It does not formalize the judgments themselves.

6.2. What the runtime leaves residual

The bittensor-subnet-template repository and historical Yuma documentation leave task definition, querying methodology, scoring rubric, and weight production in subnet-specific validator practice. Validators are expected to write or adopt subnet-specific evaluation mechanisms, query miners, score returned outputs, and set weights on chain. Historical Yuma Consensus documentation describes validators as expressing subjective preferences about miner performance and being rewarded when those evaluations align with stake-weighted validator consensus.

The residual layer includes construct operationalization (what does quality mean for this subnet?), task design (what queries elicit meaningful performance?), scoring methodology (how are responses scored?), scoring consistency (do different validators score similarly?), and calibration to external standards (if they exist for the task). None of these are formalized at the protocol level.

6.3. The architectural pattern

The architectural pattern is that Bittensor begins from residual judgment instead of a protocol-native measure of quality. Quality in Bittensor is plural: it may mean inference usefulness, data value, speed, diversity, benchmark performance, or subnet-specific value. The protocol does not settle that construct in advance. It receives weight vectors produced by validator processes that vary by subnet and by developer design. In construct-validity terms, the system begins with heterogeneous judgments about contested task value.

The runtime then formalizes authority to convert those heterogeneous judgments into commensurate reward-bearing consequences. The commensuration is accomplished through stake-weighted aggregation: validators with more stake have more authority in the weight-aggregation procedure, and the resulting consensus determines miner ranks, validator dividends, and subnet-level emission allocation.


7. Capitalized Judgment

Capitalized judgment names the arrangement more precisely than "stake-weighted evaluation." The distinction matters because stake-weighted evaluation suggests that stake weights evaluation itself; capitalized judgment names the more accurate pattern: stake weights authority over judgments produced outside the formal protocol core. Contested judgments about quality are produced in subnet-level validator practice; the runtime turns those judgments into authoritative consequence through capital-weighted aggregation that determines whose judgments count, how much they count, and how rewards are distributed.

7.1. The architectural specificity

Capitalized judgment is specific to the Bittensor architecture in ways that distinguish it from related patterns. Proof-of-stake consensus weights validator authority by stake for consensus over transaction ordering; Bittensor weights authority over evaluative judgment about contested constructs. Decentralized oracle networks weight data-provider authority by stake for reporting external facts; Bittensor uses stake-weighting for evaluation of quality. DAO voting weights governance authority by stake over protocol parameters; Bittensor ties stake-weighted judgment to reward distribution for evaluative work.

The specific pattern in Bittensor is that stake weights authority over judgments about the quality of work performed by other participants, with those judgments determining reward distribution. This is an evaluation-of-work-by-capital-holders pattern, distinct from consensus over transactions, oracle reporting, or governance voting.

7.2. The construct-validity implications

In Cronbach-Meehl (1955) terms, the capitalized-judgment architecture delegates construct operationalization to subnets (each subnet defines its own systematized concept), delegates indicator choice to validators (each validator defines its own scoring procedure), and formalizes only the aggregation of indicator scores into reward allocation. Validity failures can occur at any earlier transition (background to systematized, systematized to indicator, indicator to scores), and the runtime has no mechanism for detecting or correcting such failures.

The architectural consequence is that the protocol’s reward distribution inherits all the validity failures that occur at earlier transitions. If the systematized concept diverges from the background concept, the protocol rewards that divergence. If the indicator diverges from the systematized concept, the protocol rewards that divergence. If the scores diverge from the indicator, the protocol rewards that divergence. Capitalized judgment is the mechanism by which all these possible validity failures are converted into reward-bearing consequence.

7.3. The commensuration implications

In Espeland-Stevens (1998) and Porter (1995) terms, the capitalized-judgment architecture performs commensuration at runtime scale. Heterogeneous subnet-level judgments (about qualitatively different things, using qualitatively different indicators) are converted into a single reward-bearing metric (TAO emissions). The commensuration imposes the costs these authors identify: information loss (the qualitative differences between subnets are collapsed into a single reward scale), distortion of what is being measured (the reward scale creates incentives that may not align with the underlying evaluative constructs), and displacement of qualitative judgment (the runtime’s aggregation logic determines reward allocation without regard for the qualitative quality of the underlying judgment).

The political-economic consequences that Espeland-Stevens and Porter identify for other commensurated domains apply here conditionally. Capital-holders benefit when capital weights authority; evaluative-competence holders benefit when competence correlates with capital or when subnet scoring remains well calibrated to performance. Work that is easy to evaluate within the runtime’s aggregation logic receives clearer reward treatment, while work that is hard to evaluate becomes vulnerable to marginalization when stake and evaluative competence diverge.

7.4. The Sybil-Resistance Null

The strongest rival reading of the arrangement is comparative-institutional: stake-weighting is the standard Sybil-resistance device in permissionless systems, since evaluative authority that costs nothing to acquire is free to forge, and any open protocol must price authority somehow. Under this reading, capitalized judgment is the cost side of a familiar trade-off, and an account that documents the cost without modeling the benefit indicts every permissionless protocol equally and identifies nothing specific about Bittensor.

The null deserves a precise answer, and the answer has three parts. First, the design problem is real. The alternatives that anchor evaluative authority without capital (proof-of-personhood, accumulated reputation, permissioned evaluator panels) each reintroduce an identity registry, a bootstrapping authority, or a gatekeeper, and Bittensor’s choice occupies a coherent point in that design space. Second, Sybil resistance constrains the form of the answer without requiring this answer. Sybil resistance requires that authority be costly to acquire; it leaves open whether the same capital position should simultaneously weight consensus over judgments, multiply dividends, and gate permits. Designs that price entry while flattening influence (stake-gated admission with equal-weight aggregation, or stake floors with capped weight) purchase the same forgery resistance with a weaker coupling between capital and evaluative authority, and Tsoukalas and Falk (2020) formalize the aggregation cost of the coupling Bittensor retains. Third, the distinction in Section 7.1 carries the comparative burden: proof-of-stake consensus prices authority over transaction ordering, where any honest ordering is acceptable and the construct is settled, while Bittensor prices authority over evaluation of contested constructs, where which judgment prevails is the entire question. The Sybil function explains why authority must be costly. It leaves unexplained why the cost-bearer’s judgment about quality, an object the protocol never defines, should bind reward distribution in proportion to the size of the position. The arrangement is the protocol’s answer to a real problem; capitalized judgment names what the answer costs, and signal-capital substitution names the condition under which the cost comes due.


8. Signal-Capital Substitution

Signal-capital substitution is the candidate failure mode within capitalized judgment. It appears when capital-weighted authority is weakly calibrated to evaluative competence, so the protocol rewards capital-positioned judgment rather than well-grounded judgment. The failure mode has specific structural conditions under which it arises.

8.1. Structural conditions

Four structural conditions increase the probability of signal-capital substitution. First, validators may be bootstrapped through early trust and delegation before meaningful subnet evaluation exists; BIT-0002 documents this pattern explicitly for the dTAO subnet start call, describing "bogus validator or mining code" establishing early validator trust and child-hotkey delegations before subnet owners have published production code. Second, capital may track attention or speculative interest more closely than evaluative skill; token markets generally exhibit this property, and BIT-0002 describes participants rushing into subnet tokens before they can meaningfully evaluate whether to mine, validate, or buy. Third, validator agreement may reflect copying or prestige rather than competence; stake-weighted consensus rewards alignment with majority stake-weighted opinion, creating incentives for conformity that do not require evaluative competence. Fourth, some subnet tasks may be weakly grounded enough that the protocol lacks an external benchmark capable of disciplining internal rankings; for such tasks, no external test can detect calibration failure.

8.2. Runtime mechanisms that support the failure mode

The subtensor runtime supports this failure mode structurally because stake enters every decisive point in the aggregation pipeline. Validator permit eligibility depends on stake. Active stake weights the consensus computation. Miner rank depends on active stake through the matrix multiplication. Validator dividends are multiplied by active stake. Emission allocation at the subnet level operates through flow-sensitive logic tied to capital flows.

The runtime does contain an anti-copying repertoire, and the claim must be stated against it. Commit-reveal weight submission (default-enabled network-wide since August 2025, with a per-subnet owner override; default states are read from the pinned runtime and its migrations, with live per-subnet toggles outside the traced record) makes copied weights stale by delaying their publication. Bond mechanics pay validators whose weight allocations precede consensus movement, with the bonds penalty clipping weight placed above consensus that consensus never ratifies; the optional liquid-alpha variant (per-subnet opt-in, disabled by default at the pinned commit) steepens that prediction premium. Each of these mechanisms disciplines a specific behavior, and the behavior is timing relative to the stake-weighted consensus signal. Commit-reveal alters when weights become public, while the standard against which revealed weights are scored remains consensus among validators; bonds reward judging before the consensus signal is observable, conditional on convergence with the future consensus signal. The pipeline’s complete input surface at the pinned commit consists of stake quantities, validator-submitted weight vectors, registration and activity metadata, and hyperparameters. No storage item, extrinsic, or epoch input carries external ground truth about miner output quality.

The precise claim is therefore architectural. The runtime’s mechanisms reward temporally independent judgment (judging before the consensus signal is observable) and never evaluatively grounded judgment (judging correctly against an external standard), because no external standard enters the pipeline. Where a subnet’s validator population independently tracks an external standard, consensus anticipation and evaluative competence coincide, and the runtime cannot distinguish that case from coordinated drift; the indistinguishability is the structural condition this section names. If stake-authorized judgment is badly calibrated, the runtime has no internal mechanism that re-anchors authority to demonstrated competence against any external standard before rewards are distributed.

8.3. The failure-mode taxonomy

Signal-capital substitution can operate through four pathways that the structural conditions above enable. Bootstrapping capture occurs when capital-based validator authority is established before substantive evaluative work begins, and the capital-based authority persists even as evaluative work emerges or fails to emerge. Speculative-attention capture occurs when capital flows track speculative interest rather than evaluative substance, so capital-weighted authority becomes authority over speculative narrative rather than over evaluation. Consensus-conformity capture occurs when stake-weighted consensus rewards alignment with majority stake-weighted opinion, inducing validators to copy high-stake validators rather than evaluate independently. Weak-construct capture occurs when the subnet task lacks external calibration, so internal stake-weighted consensus can produce any ranking without external constraint.

Each pathway predicts specific empirical patterns, and the predictions must be ones the protocol specification does not already entail. Bootstrapping capture predicts validator authority positions established before substantive evaluative work begins and persisting after that work emerges or fails to emerge; the observable is the temporal ordering of stake establishment and evaluative-substance markers, since stake-reward correlation itself is entailed by the dividend logic under every hypothesis. Speculative-attention capture predicts high correlation between subnet-token price trajectory and validator-reward distribution. Consensus-conformity capture predicts high correlation among validator-submitted weights within stake strata. Weak-construct capture predicts high divergence between internal protocol rankings and external performance benchmarks (when such benchmarks exist). The conformity and substitution pathways have formal precedents (Miller, Resnick, and Zeckhauser 2005; Tsoukalas and Falk 2020); the taxonomy locates where their equilibrium conditions can arise inside a deployed runtime.

8.4. Lui-Sun evidence engagement

Lui and Sun (2025) report high validator stake-reward correlation, weak miner performance-reward correlation, and extreme concentration. Two discipline rules govern what this evidence can support. First, their performance measure is Bittensor’s own trust metric, constructed from the stake-weighted consensus process, so the results are strongest for claims about internal-vocabulary reward-capital alignment and weaker for claims about external-benchmark calibration. Second, and more restrictively, high validator stake-reward correlation is entailed by the emission logic itself: validator dividends are stake-proportional by construction (Section 6.1), so the correlation is predicted by every hypothesis about calibration, including perfect calibration, and carries no discriminating power among the pathways. The Lui-Sun findings that retain evidential value against the pathway taxonomy are the ones the specification does not entail: the weak miner performance-reward correlation, which could have come out otherwise and is consistent with weak-construct capture or with divergence between validator scoring and stake-weighted aggregation, and the persistence of concentration patterns across time, which bears on whether early capital positions decay.

The evidence is insufficient to distinguish among the four pathways or to establish signal-capital substitution with external-benchmark calibration. Stronger evidence requires subnet-level studies that compare internal stake-weighted rankings against external performance benchmarks. Section 12 specifies falsification conditions that future research can address.


9. Governance-Pressure Evidence

9.1. BIT-0002: launch-phase capital-capture pressure

BIT-0002 (dTAO Subnet Start Call) records a launch environment in which capital and validator authority move ahead of evaluative substance. The proposal states directly that "bogus validator or mining code" attempts to establish early validator trust and child-hotkey delegations before subnet owners have time to publish and organize production code. Participants are described rushing into subnet tokens before they can meaningfully evaluate whether to mine, validate, or buy. The institutional pattern BIT-0002 documents is the bootstrapping-capture pathway: capital-based validator authority established early, potentially persisting even as substantive evaluative work emerges or fails to emerge.

The governance response (the start-call mechanism) addresses the symptom (rushed early capital allocation) while leaving the underlying structural condition intact: the capitalized-judgment architecture allows capital to establish authority before evaluative substance is available. The start-call adds a procedural delay before subnet token trading becomes active, which can reduce the severity of bootstrapping capture but does not eliminate the structural pathway.

9.2. BIT-0004: continuation-phase weak-calibration pressure

BIT-0004 (Subnet Deregistration) records a second pressure. The proposal states that non-functional subnets can continue consuming emissions, chain footprint, and compute resources. It proposes price-based pruning when the subnet limit is reached. The dissolution path is root-only (requiring root-level governance action rather than subnet-level mechanisms).

The institutional pattern BIT-0004 documents is the weak-construct pathway: subnets can persist without demonstrating functional evaluation, because the protocol lacks an external measure of functionality that would trigger deregistration automatically. The governance response (price-based pruning) ties deregistration to market signals (subnet token price), with no direct measure of evaluative performance. This choice is consistent with the broader capitalized-judgment architecture: where direct measurement of evaluative quality is unavailable, capital-based signals substitute.

9.3. Structural implication of the governance record

The governance record supports a specific structural claim. The Bittensor protocol recognizes that capital-capture and weak-calibration pressures exist (the BIT proposals document them) and attempts to address them through governance mechanisms (start-calls, deregistration procedures) that operate downstream of the capitalized-judgment architecture without modifying the architecture itself. The governance mechanisms supplement the capitalized-judgment core while leaving the structural conditions for signal-capital substitution in place. The runtime-level modification record is weighed in Section 12.4 and carries the same pattern.


10. System-Wide Empirical Pattern

Lui and Sun (2025) provide the first system-wide empirical study of Bittensor available at the May 2026 review. Their methodology and findings warrant detailed engagement because they establish the empirical baseline against which subsequent work must position itself.

10.1. Methodology

Lui and Sun analyze 6,664,830 events across 64 active subnets and 121,567 unique wallets through February 12, 2025. They construct metrics for validator-stake-reward correlation, miner-performance-reward correlation, and concentration across multiple dimensions. Their performance measure for miners is Bittensor’s own trust metric, which is constructed through the stake-weighted consensus process documented in the subtensor runtime. Their concentration measures include stake Gini coefficients, reward Gini coefficients, and cross-wallet participation patterns.

10.2. Key findings

Their reported findings include: high validator stake-reward correlation (stake is strongly predictive of reward for validators); weak miner performance-reward correlation (the trust metric is weakly predictive of reward for miners); extreme concentration across stake, reward, and participation dimensions; and persistent concentration patterns across time. The concentration findings are carried here qualitatively, as Lui and Sun summarize them; the underlying coefficients belong to their paper, and the descriptive stake-distribution work Part 13 marks as this analysis’s next empirical step would put independent figures on the validator-side concentration directly.

10.3. Analytical engagement

The Lui-Sun results support specific claims within the internal-vocabulary scope, subject to the entailment discipline stated in Section 8.4: validator stake-reward correlation is a consequence of the dividend specification and is set aside as non-discriminating. Weak miner performance-reward correlation within the internal trust metric suggests that the stake-weighted aggregation produces results that diverge from the internal performance signal, a finding consistent with weak-construct capture or with a residual divergence between validator-submitted scoring and stake-weighted aggregation. Extreme concentration is consistent with the Matthew-effect dynamics the capitalized-judgment architecture structurally supports (capital-based authority compounds across time).

The Lui-Sun evidence supports weaker claims than external-benchmark evidence would support. Internal trust is constructed from the same stake-weighted consensus that determines reward; correlations between internal-trust and reward therefore cannot fully distinguish calibrated protocol behavior from protocol behavior that satisfies its own internal vocabulary without external calibration. Lui-Sun themselves note this limitation. The evidence supports a narrower claim: Bittensor exhibits specific patterns within its own evaluative vocabulary and leaves external-benchmark calibration as an open empirical question.


11. SN1 Packet Reconstruction

SN1 text-prompting supplies one usable subnet-level entry point for detailed packet reconstruction. The historical OpenValidators mirror preserves netuid = 1 run metadata and raw parquet rows with prompts, queried uids, completions, rewards, blocks, and run-local snapshots. The mirror enables reconstruction of a concrete validator workflow with event-level linkage; three deterministic movement summaries (dense, prompt-family, and prompt-template fit-to-score movement) replay exactly from the preserved parquet. The reconstruction still does not supply external weak-calibration proof. It narrows where that trace must work. The numerically detailed claims require a formal reproduction package before publication: preservation identifiers, script hashes, row ranges, reconstruction checks, dependency lock, and an adjudication layer external to the subnet’s own reward machinery.

Five measurement terms recur in this Part and are defined by their use here. The weighted base is the composite reward surface produced by the run’s three-component combination (0.3 DPO, 0.3 reciprocate, 0.4 RLHF) before filter gating. Support at a stage is the proportion of scored completions retaining a nonzero reward value after that stage’s gating, so weighted-base support measures survival into the composite while relevance and task-validation support measure survival through those filters. Fit, equivalently prompt overlap, is the analysis-side lexical-overlap measure between a completion and its prompt, computed over the recovered rows. Movement is the change in the validator’s saved moving_averaged_scores state attributable to a row under the validator’s own EMA rule, reported as absolute movement. A wrap-aligned snapshot is a saved score state taken at the epoch boundary at which the traced loop submits weights and immediately logs save_state(), which is the alignment used for score-to-weight comparison.

11.1. Closure against preserved state

The first ten answer-task rows in the SN1 slice close against preserved Weights and historical NeuronInfoLite state across all 500 queried miner observations. The paired save-state rows preserve the validator’s moving_averaged_scores vector over the metagraph rather than only the reward outputs. Within that bounded window, rewarded queried miners sit far above zero-reward queried miners on the saved score surface in every row that pays positive rewards. The early snapshot transitions reconstruct from the prior snapshot plus the intervening task reward rows under the validator’s own EMA rule.

11.2. Task-family divergence

Across the recovered OpenValidators runs, the followup family moves the saved score state at least as much as, and usually more than, the answer family. The form of reward anticipation changes with reward density: the sparse regime is thin-support and ranking-sensitive; the dense regime is broad-support and inclusion-sensitive. Both traced netuid = 1 runs use the same three-component reward base, with 0.3 DPO, 0.3 reciprocate, and 0.4 RLHF. The traced workflow cannot be described through the template-repository default alone. The normalized component reward surfaces already split in the same sparse-versus-dense direction across augment, followup, and answer families alike.

11.3. Late-step narrowing mechanism

Once the weighted base is reconstructed from the OpenValidators rows and fitted reward surfaces, the later-step followup ladder becomes clearer. In the dense run, weighted-base support stays almost flat from 98.90% at followup0 to 98.75% at followup3, while relevance support falls from 94.35% to 82.87%, and task validation carries it the rest of the way to 79.17%, nearly matching final support at 79.11%. The sparse run shows the same sequence on a thinner scale. In both runs, the main narrowing burden enters at relevance and then at task validation, with diversity and NSFW contributing little.

11.4. Relevance-filter mechanism

The September 2023 branch that matches the recovered prompt flow places followup task validation mainly in a wrapper-content filter that zeroes outputs leaking answer or summary scaffolding into a question. In the dense run, that historical rule matches the logged task-validator surface at about 99%. The widening later-step burden sits mainly in relevance-only failures, which rise from 4.52% at followup0 to 15.45% at followup3, while task-only failures stay in the 3-4% range. Those relevance-only failures grow more anchored to prompt language as the ladder advances: prompt-overlap rises from 0.2178 to 0.5209, accepted rows rise from 0.4527 to 0.6038, and application or impact drift rises from 22.54% to 40.83% of relevance-only failures.

11.5. Score-mechanism properties

At the accepted-completion level, prompt fit becomes modestly informative about score movement. The correlation between overlap and absolute EMA movement rises from 0.149 at followup1 to 0.210 at followup3, while the correlation between overlap and raw reward stays near zero. That pattern survives the split between the raw_context and qa_context prompt families. By followup3, accepted completions in the highest-fit quartile move saved score about 40% more than those in the lowest-fit quartile, even though mean reward is slightly lower and mean prior saved score is nearly identical. The OpenValidators packet therefore looks more like revision around a prior score state than like simple reward maximization.

11.6. Score-to-weight alignment

A direct score-to-weight alignment pass hardens the final step. In the sparse run, a wrap-aligned snapshot at block 1164303 reconstructs the stored chain row almost exactly, with 142 of 145 emitted values matching integer-for-integer. In the dense run, same-block wrap-aligned candidates retain near-complete support closure and strong order continuity between locally reconstructed emitted rows and historical chain rows despite missing literal equality, with best shared-support Pearson at 0.8918. Across 23 dense wrap candidates, the nearest prior snapshot beats the same-block snapshot on shared-support Pearson, shared-support Spearman, and mean absolute shared difference in every case, and improves top-50 overlap in 21 of 23, with prior lag centered at 29 blocks. The traced validator loop is consistent with that shape: once the epoch boundary is crossed, the code submits weights without waiting for finalization and then immediately logs save_state() at the current block. On the lag evidence, the dense mismatch is best read as a timing artifact; that reading is the best-supported of the stated alternatives rather than a settled identification.

11.7. Residual error structure

The remainder is not concentrated in the decisive upper ranks. The upper 250 ranks carry about 29.5% of chain mass but only 26.3% of dense error; the long tail carries about 70.5% of chain mass and 73.7% of the error. Replacing the simplified local reconstruction with the stricter client-plus-chain route changes the dense row by about one integer unit on average (one to four across the whole emitted row), leaves top-50 overlap and exact-ratio unchanged, and moves Pearson only at the 1e-08 scale. Interval replay narrows the same point further: 21 of 23 dense intervals close back onto the saved snapshot to numerical precision; the two outliers reduce to hotkey-reassignment resets.

11.8. Implications for the architectural analysis

The SN1 packet trace supports specific architectural claims, and its central finding must be stated with its actual valence. Within the traced window, the validator’s local scoring process is reconstructable: the components (DPO, reciprocate, RLHF), the stages (relevance filtering, task validation, diversity, NSFW), and the late-step narrowing mechanism (relevance filter driving support loss) are visible in the trace, and local judgment reaches chain-facing weight with high fidelity (142 of 145 emitted values matching integer-for-integer in the sparse run; 21 of 23 dense intervals closing to numerical precision on replay). For the signal-capital-substitution pathway this is a constraining result. At the one subnet traced, the judgment-to-weight pipeline transmits local evaluation faithfully, so whatever detachment of reward from evaluative substance occurs in the system does not occur at this layer, for this validator, in this window. The trace accordingly narrows where substitution could operate: at the stake-aggregation layer (whose judgments count and how much), at the permit layer (who evaluates at all), and inside the local scoring components themselves (whether RLHF, DPO, and reciprocate scores track external task quality), since the transmission layer is exonerated on this record.

What the SN1 reconstruction does not establish is external calibration. The local scoring mechanism uses RLHF, DPO, and reciprocate components that are themselves protocol-internal or closely protocol-adjacent measures. Whether those components correlate with external performance benchmarks for the task (text-prompting quality for end-user applications, for instance) remains an open question. The reconstruction bounds the empirical burden: future research can investigate calibration at the specific subnet with specific benchmarks, instead of treating calibration as a whole-network question.


12. Falsification

Three operationalized conditions would weaken the claims.

12.1. Calibrated capital-weighted authority

The claim that the capitalized-judgment architecture leaves calibration unsecured weakens if subnet-level studies with external performance benchmarks show that higher-authority validators are consistently better calibrated to task performance than lower-authority validators across a broad range of subnet types; such a finding would show stake operating as a reliable proxy for evaluative competence even without an internal re-anchoring mechanism.

Observable metric: correlation between validator stake and validator accuracy against external performance benchmarks across subnet types.

Measurement protocol: identify at least five subnets with tasks that admit external benchmarks. For each subnet, construct an external benchmark and measure each validator’s accuracy against it. Compare validator accuracy to validator stake.

Falsification threshold: a positive correlation of r > 0.5 between validator stake and external-benchmark accuracy across at least three of the five studied subnets. The criterion is effect size, stated as a convention pending a calibrated baseline, with statistical significance reported separately.

12.2. Reward-performance alignment

The signal-capital-substitution claim weakens if reward allocation tracks external performance strongly across the majority of studied subnets while stake concentration has only a limited marginal effect on who receives emissions.

Observable metric: correlation between miner reward and external-benchmark performance, controlling for validator-stake-weighting effects.

Measurement protocol: identify at least five subnets with tasks that admit external benchmarks. For each subnet, correlate miner reward with external-benchmark performance. Decompose the correlation into stake-weighted and non-stake-weighted components.

Falsification threshold: a positive correlation of r > 0.5 between miner reward and external-benchmark performance across the majority of studied subnets (an effect-size criterion, with significance reported separately), with stake-weighting effects explaining less than 30% of the miner-reward variance; the variance bound is likewise a stated convention.

12.3. Absence of launch-phase capital capture

The bootstrapping-capture claim weakens if launch and continuation evidence shows that evaluative substance reliably precedes capital inflow and validator trust bootstrapping, making the BIT-0002 and BIT-0004 pathologies exceptional cases instead of structurally recurrent pressures.

Observable metric: temporal ordering of evaluative-substance markers (functional code, task specification, benchmark publication) and capital-inflow markers (validator stake, child-hotkey delegation, token trading volume) across subnet launch cohorts.

Measurement protocol: for each subnet launched in a defined observation window, identify the date of first functional code deployment, first task specification publication, first benchmark publication, first validator stake establishment, first child-hotkey delegation, and first significant token trading volume. Measure the proportion of subnets where evaluative-substance markers precede capital-inflow markers.

Falsification threshold: evaluative-substance markers precede capital-inflow markers in more than 70% of observed subnet launches across at least 20 launches.

12.4. Alternative explanations

Two alternative explanations for the observed patterns deserve explicit engagement.

Token-market-efficiency as primary driver. Under this explanation, the patterns Lui-Sun observe and the governance pressures BIT-0002 and BIT-0004 document reflect efficient token-market sorting. Capital flows toward subnets with higher expected returns, which correlate with evaluative quality because high-quality subnets produce sustained demand. That account predicts capital flows following quality over time. The observed patterns (bootstrapping capture, weak miner performance-reward correlation, extreme concentration) would be transient under efficient-market conditions. Their persistence across the Lui-Sun study period and the recurring governance pressures documented in the BIT process suggest that capital-flow-to-quality mapping is weak at best. The efficient-market explanation therefore requires additional mechanism specification to account for persistent divergence between capital and quality.

Protocol-design evolution as primary driver. Under this explanation, the patterns reflect transitional features of an evolving protocol that will converge to better capital-quality alignment as the protocol matures. The record of modification is real and must be weighed: commit-reveal weight submission was introduced against weight-copying and made default network-wide in August 2025, and bond mechanics with the optional liquid-alpha variant reward early position-taking against later consensus. These modifications discipline the timing of validator judgment relative to consensus. The structural features identified here (stake-weighted permits, stake-weighted consensus, stake-weighted dividends, capital-flow-tied emissions) survive each modification, and the modifications themselves anchor to the consensus signal rather than to any external standard (Section 8.2). Protocol evolution to date has therefore hardened the anti-copying perimeter around the capitalized-judgment core while leaving the core unaltered; an evolution that did alter it, an external-benchmark anchor entering the reward path, is the observable this alternative predicts and the record does not yet contain.


13. Scope of Inference

The available record is architecture-first, with no subnet-by-subnet quality audit. Lui and Sun’s study ends on February 12, 2025 and relies on a protocol-native performance proxy, which limits what can be said about external quality. The subtensor runtime and traced public surfaces show how authority is weighted and how rewards are distributed; they do not by themselves establish evaluator competence, collusion, or construct validity for every subnet task.

The historical OpenValidators mirror supplies one usable SN1 trace surface. The first ten answer-task rows in that surface close against preserved identity, weight, and saved-score state. That surface is historically bounded and narrower than the full local validator event schema.

Replacing the simplified local reconstruction with the stricter client-plus-chain route changes the dense row by about one integer unit on average (one to four across the whole emitted row), leaves top-50 overlap and exact-ratio unchanged, and moves Pearson only at the 1e-08 scale (section 11.7). The open dense remainder is better treated as a lag-and-residual-transform problem.

Narrow chain-side structure does not identify a dead or empty subnet on its own. A traced calibration contrast is missing. Stronger calibration claims within SN1 will require an adjudication layer external to the subnet’s own reward machinery. The finding here is bounded to pressures inside Bittensor’s stake-weighted evaluation architecture.

The analysis also leaves the actor level undescribed. Stake-distribution descriptives at the pinned block, the identity classes of the top validator positions, and the root-network share are recoverable from the archive endpoint the analysis already pins, and they constitute the next descriptive step for the political-economy reading; capital appears in this paper as an architectural variable rather than as named holders.

The analysis does not measure subnet-level construct validity for each of the 64 active subnets in the Lui-Sun study because subnet-level benchmarking remains unavailable at that scale. The architectural claim (capitalized judgment is the runtime’s formalization pattern) is supported by the runtime evidence. The failure-mode claim (signal-capital substitution is a structural possibility under specific conditions) is supported by architectural analysis, governance evidence, and system-wide empirical patterns. The calibration claim at specific subnets remains an empirical question that subnet-level research can address.


14. Conclusion

Bittensor’s architectural choice is to formalize authority over evaluation while leaving evaluation itself in residual subnet practice. The subtensor runtime at specific code locations implements stake-weighted aggregation over validator-submitted weight vectors; the underlying evaluation is produced residually in subnet-specific validator practice. The two mechanisms developed here, capitalized judgment as architectural posture and signal-capital substitution as candidate failure mode, name the institutional pattern this architectural choice produces.

Construct-validity theory (Cronbach and Meehl 1955; Adcock and Collier 2001), commensuration research (Espeland and Stevens 1998; Porter 1995), and classification theory (Bowker and Star 1999) specify the analytical implications of the capitalized-judgment architecture. The protocol performs commensuration at runtime scale, converting heterogeneous subnet-level judgments into a single reward-bearing metric through mechanisms that weight authority by capital. The commensuration inherits all validity failures that occur at earlier transitions (background concept, systematized concept, indicator, scores), because the runtime has no mechanism for detecting or correcting failures at those transitions.

The governance record (BIT-0002, BIT-0004) documents specific pressures within the architecture. The system-wide empirical record (Lui and Sun 2025) documents specific patterns within the protocol’s internal vocabulary. The SN1 packet trace documents a concrete local scoring mechanism and its connection to chain-facing weights. Together, these sources support architectural claims about the capitalized-judgment mechanism and failure-mode claims about signal-capital substitution as a structural possibility. Calibration claims for specific subnets require external-benchmark evidence that future research can develop.

In Bittensor, residuals do not remain inert. The residual evaluation layer, task definition, scoring methodology, and construct operationalization, is where institutional consequence accumulates. What the runtime does not formalize, subnets must supply through their own practice, and that supply determines whether the protocol’s capitalized-judgment architecture produces calibrated or miscalibrated reward allocation. The protocol’s formal openness to subnet variation coexists with structural pressure toward capital-aligned authority, and the balance between these forces is the determinative question for effective evaluation in the Bittensor ecosystem.


References

Adcock, Robert, and David Collier. 2001. "Measurement Validity: A Shared Standard for Qualitative and Quantitative Research." American Political Science Review 95(3): 529-546.

Alston, Eric, Wilson Law, Ilia Murtazashvili, and Martin Weiss. 2022. "Blockchain Networks as Constitutional and Competitive Polycentric Orders." Journal of Institutional Economics 18(5): 707-723.

Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. MIT Press.

Burnell, Ryan, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo. 2023. "Rethink Reporting of Evaluation Results in AI." Science 380(6641): 136-138.

Cong, Lin William, Liam Fox, Siguang Li, and Luofeng Zhou. 2025. "A Primer on Oracle Economics." Journal of Corporate Finance 94: 102800.

Cong, Lin William, Xi Li, Ke Tang, and Yang Yang. 2023. "Crypto Wash Trading." Management Science 69(11): 6427-6454.

Cong, Lin William, Ye Li, and Neng Wang. 2022. "Token-Based Platform Finance." Journal of Financial Economics 144(3): 972-991.

Cronbach, Lee J., and Paul E. Meehl. 1955. "Construct Validity in Psychological Tests." Psychological Bulletin 52(4): 281-302.

Davidson, Sinclair, Primavera De Filippi, and Jason Potts. 2018. "Blockchains and the Economic Institutions of Capitalism." Journal of Institutional Economics 14(4): 639-658.

Espeland, Wendy Nelson, and Mitchell L. Stevens. 1998. "Commensuration as a Social Process." Annual Review of Sociology 24: 313-343.

Gray, Mary L., and Siddharth Suri. 2019. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt.

Grossman, Sanford J., and Oliver D. Hart. 1986. "The Costs and Benefits of Ownership: A Theory of Vertical and Lateral Integration." Journal of Political Economy 94(4): 691-719.

Jurca, Radu, and Boi Faltings. 2009. "Mechanisms for Making Crowds Truthful." Journal of Artificial Intelligence Research 34: 209-253.

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023. "Holistic Evaluation of Language Models." Transactions on Machine Learning Research 2023: 1-108.

Lui, Elizabeth, and Jiahao Sun. 2025. "Bittensor Protocol: The Bitcoin in Decentralized Artificial Intelligence? A Critical and Empirical Analysis." arXiv:2507.02951.

Miller, Nolan, Paul Resnick, and Richard Zeckhauser. 2005. "Eliciting Informative Feedback: The Peer-Prediction Method." Management Science 51(9): 1359-1373.

opentensor. 2026. "bits." GitHub repository. Accessed May 4, 2026. https://github.com/opentensor/bits

opentensor. 2026. "subtensor." GitHub repository. Accessed May 4, 2026. https://github.com/opentensor/subtensor

opentensor. n.d. "Yuma Consensus." Accessed May 4, 2026. https://docs.learnbittensor.org/learn/yuma-consensus

Porter, Theodore M. 1995. Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton University Press.

Raji, Inioluwa Deborah, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. "AI and the Everything in the Whole Wide World Benchmark." Proceedings of the NeurIPS Benchmarks and Datasets Track.

Raji, Inioluwa Deborah, I. Elizabeth Kumar, Aaron Horowitz, and Andrew D. Selbst. 2022. "The Fallacy of AI Functionality." Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2022).

Rao, Yuma, Jacob Steeves, Ala Shaabana, Daniel Attevelt, and Matthew McAteer. 2020. Bittensor: A Peer-to-Peer Intelligence Market. https://bittensor.com/whitepaper

Schädler, Lukas, Michael Lustenberger, and Florian Spychiger. 2023. "Analyzing Decision-Making in Blockchain Governance." Frontiers in Blockchain 6: 1256651.

Tsoukalas, Gerry, and Brett Hemenway Falk. 2020. "Token-Weighted Crowdsourcing." Management Science 66(9): 3843-3859.

van Pelt, Rowan, Slinger Jansen, Djuri Baars, and Sietse Overbeek. 2021. "Defining Blockchain Governance: A Framework for Analysis and Comparison." Information Systems Management 38(1): 21-41.