TEXT AND DATA MINING AI TRAINING AND COPYRIGHT: IS INDIA HEADING TOWARDS A KNOWLEDGE ENCLOSURE

Authored By: RIYA SINGH

HIMACHAL PRADESH NATIONAL LAW UNIVERSITY, SHIMLA

ABSTRACT

The technical method that makes it possible to train modern artificial intelligence models is called text and data mining, or TDM. Different TDM regimes have been implemented in different countries. For example, the UK has implemented a statutory exception for lawful users, while the European Union has adopted a two-tiered approach, requiring research institutions with exceptions and allowing commercial TDM with certain rights for rightholders. The United States has created a fair use theory for mass digitisation and search. India, on the other hand, has not yet adopted a TDM exception and is still reliant on the ambiguous Section 52 “fair dealing” provisions for large-scale automated copying and storage. In order to better align copyright with public interests in the AI era, this paper proposes doctrinal, legislative, and policy changes. It argues that India’s current position threatens to create a “knowledge enclosure” where access to digital training materials is effectively gated by licensing or private enforcement.

WHY TDM MATTERS: THE TECHNICAL AND ECONOMIC STAKES

Text and Data Mining (TDM) is the technical method used to train AI models using massive datasets of text, images, audio, and code. It is more than just an esoteric academic notion. TDM typically entails three steps: (a) making full-text copies of the original materials; (b) extracting structured characteristics; and (c) storing embeddings and indices for future use. These actions equate to copying and storing, frequently on a massive scale, which activates copyright’s fundamental reproduction and adaptation rights.[1]

Economically speaking, the availability of large, high-quality datasets gives businesses that possess carefully selected training datasets and licensing infrastructure a disproportionate amount of market leverage. This in turn creates a “data-enclosure” dynamic: while knowledge commons promoted innovation, proprietary datasets and licenses may result in the privatisation of training inputs, raising obstacles to entry for public interest actors, civil society researchers, and new entrants.[2]

THE EU MODEL: A TWO-TIERED, RIGHTS-AWARE EXCEPTION (DIRECTIVE 2019/790)

India can benefit greatly from the two-tiered TDM system proposed by the EU’s DSM Directive of 2019. Reproductions and extractions made for scientific purposes by research organisations and cultural heritage institutions are required to be exempt under Article 3; lawful users may conduct TDM for any purpose under Article 4, with the option for rightholders to opt out if they have reserved rights (e.g., through machine-readable flags).[3]

This system aims to strike a balance between many factors. Libraries, universities, and museums are free to mine works for scientific research without having to negotiate millions of individual permissions thanks to Article 3, which guarantees research freedom in the public interest. Since rightholders are free to “reserve” TDM if they so choose and member states are allowed to implement retention and security measures, Article 4 also recognises the business interests of rightholders.[4] In order to prevent private contracts from superseding the statutory exceptions, the DSM Directive further rendered contractual overrides unenforceable against certain exceptions.[5]

Strengths of the EU design include (a) guaranteeing legal certainty for legitimate research, (b) striking a fair balance between upholding rightholders’ interests, and (c) flexibility, which allows Member States to customise the opt-out mechanism and security obligations. However, there are flaws in the EU architecture as well. For example, the opt-out method allows rightholders to restrict commercial TDM, which may benefit incumbents, and border enforcement of rights is still a nightmare.[6]

UK AND US APPROACHES – STATUTORY VS. FAIR USE

Through Section 29A of the Copyright, Designs and Patents Act (CDPA), the United Kingdom created a TDM exception that permits TDM by legitimate users for any purpose as long as technical protection measures are followed. If the content is made publicly available, rightholders may specify TDM rights through machine-readable conditions.[7] The UK method is comparable to the EU approach, but by giving lawful users an explicit statutory entitlement, it simplifies the system.[8]

The United States, on the other hand, has applied fair use as a flexible, context-dependent theory. Landmark cases such as Authors Guild v. Google and HathiTrust have upheld large-scale scanning and indexing as fair use because the copies have a transformative public purpose (searchability, knowledge access), and not as a substitute for the originals.[9] The American method is adaptable but unpredictable; decisions about fair use require weighing a number of variables and are expensive in court. The trade-offs between the U.S. fair-use approach and the UK statutory model are evident: fair use (U.S.) preserves doctrinal flexibility but is very expensive for defendants and researchers to litigate or settle; statutory certainty (UK/EU) is less uncertain but may rigidify into an industry-favorable framework that favours licensing.[10]

INDIA’S LEGAL FRAMEWORK: SECTION 14 VS SECTION 52 — A GAP FOR TDM

Section 14 of the Indian Copyright Act of 1957 grants the author a number of unique rights, including the ability to reproduce and store their work electronically.[11] Fair dealing for “private or personal use, including research” and other specified public interest exceptions are among the exclusions provided under Section 52 of the Copyright Act.[12] Section 52, however, is set up with human, personalised reasons in mind (such as private study, criticism, reporting) and does not include computer programs in some sub-clauses; it was not intended to envisage large-scale automated copying and storage for machine learning.[13]

As a result, TDM in India has two textual challenges. First, the literal fit: the full copying and long-term storage needed by TDM are immediately inside the reproduction right in Section 14 and thus prima facie violations unless an exception exists. Second, the doctrinal fit: the “fair dealing” exception in Section 52 has typically been accorded a restrictive construction in Indian case law and is geared for human-scale purposes (such as private study, criticism). There is also a substantial question of whether large-scale automated copying for AI research would be termed “fair dealing.” This generates a chilling effect for research and commercial development since institutions must choose between expensive licensing and legal risk.[14]

CASE LAW AND DOCTRINAL SIGNALS

The question of whether large-scale TDM for AI research is acceptable without a licence has not been directly addressed by an Indian appellate court. The exclusivity of reproduction rights and the limited, specified scope of Section 52 exceptions have been repeatedly emphasised by the Indian judiciary.[15] A reluctance to approve widespread TDM without legislative action may be indicated by this doctrinal approach of robust linguistic protection of reproduction rights combined with a limited interpretation of exceptions. International precedence is quite instructive. The Second Circuit’s rulings in Authors Guild v. Google and HathiTrust established the transformative public purpose of scanning and indexing, which, if non-commercial and research-focused rather than a market alternative, might be compared to TDM.[16] The DSM Directive established this approach. The EU courts have also struggled with temporary copy rules and the significance of exceptions’ technological compatibility.[17]All of these precedents suggest that India might either adopt a court-driven fair use rule (as in the US) or a rights-supporting statutory exception (like in the EU/UK). Every route has benefits and cons.

CRITICAL THINKING: SHOULD AI TRAINING BE CONSIDERED INFRINGEMENT OR FAIR USE?

This is where the rubber meets the road. Three essential arguments suggest that a “blanket infringement unless licensed” approach would be undesirable.

(A) Market power and the failure of public goods. AI development will be exclusive to a select few if only big incumbents have the means to license and clear massive datasets. By commercialising the commons and raising the cost of public interest research, copyright law shouldn’t be used to accelerate data monopolies.[18]

(B) TDM’s functional characteristics. Features, embeddings, and model weights produced by TDM are not exact replicas of the original content. These are functional transformations that are (1) extremely derivative in a technical sense and (2) non-substitutive (the end user does not access the source material through the model). When all copies of TDM are viewed as being equivalent to commercial redistribution, this misrepresents the extent of the harm. This makes the case for non-expressive, non-substitutive training to have a use-based exception.[19]

(C) Innovation and Knowledge Access. If copyright laws impede machine-assisted knowledge discovery, the historical grounds for copyright—promoting learning and innovation—are undermined. To enable translation, summarisation, and useful AI, a careful balance must be struck between preserving public access to model training and protecting market incentives for creative reuse.[20]

In light of these criteria, a fair use-style approach emphasising transformative intent, non-substitution usage, scope, and commercial nature may apply. However, legislation is recommended because relying on fair use in litigation is costly and dangerous. India needs a specific TDM exception that distinguishes between open commercial model training that respects legitimate rightsholder interests (such opt-outs or compensation plans), scientific research, and non-commercial civil society use).[21]

POLICY OPTIONS FOR INDIA: ASSESSING RIVAL MODELS AND TRADE-OFFS

The legal ambiguity around text and data mining for AI research in India can be addressed by a number of regulatory solutions, each with unique advantages and disadvantages. The implementation of a statutory mandatory research exception, as stipulated in Article 3 of the EU’s DSM Directive, is one of the ideas that has been frequently proposed. study groups and cultural heritage institutes would be able to use text and data mining for scientific study without first obtaining permission from rightsholders thanks to this exception. This model’s main advantage is that it protects public interest research from transactional and legal barriers by giving publicly supported universities, libraries, and research institutes legal certainty. However, until further license mechanisms are created, this model’s scope may be extremely limited, potentially excluding startups, individual researchers, and commercial innovators. As a result, this paradigm might unintentionally maintain a rigid separation between the public and private sectors, which is inaccurate given the state of AI research today.

A general user exception with a machine-readable opt-out system, based on Section 29A of the UK Copyright, Designs and Patents Act and Article 4 of the DSM Directive, would be a more thorough option. Under this approach, every legitimate user would be free to conduct text and data mining operations unless the rightsholder has explicitly reserved their rights, usually through technical means such as metadata or machine-readable flags.

In addition to being more inclusive, this system better reflects the growing trend of innovation occurring outside of academic institutions. But it also honours the rightsholder’s freedom to choose not to have their works used for profit. But there are also a number of issues with this system. Opt-out systems are likely to favour major incumbents who have the technical power to enforce reservations, while smaller creators may lack the ability to successfully exert such control. Additionally, the creation of efficient and discoverable opt-out systems would be necessary for this system to succeed; otherwise, doubt might persist.

The fair use/fair dealing judicial method, which is comparable to the US system, could be the third choice. Under this framework, the courts would decide whether AI training is lawful on a case-by-case basis, considering its intended application, transformational nature, impact on the market, and public benefit. This system’s versatility is its main advantage. The courts can readily apply copyright rules to new technological realities without any legislative adjustments. However, this freedom comes at a very steep cost. Smaller players are at a disadvantage since fair use cases are expensive, time-consuming, and unclear. In the Indian environment, where the fair dealing exclusions are limited and not open-ended, the judicial approach alone may not be useful.

Establishing a license-back or collective compensation plan, which has been compared to a Copyright Clearance and Remuneration Authority for Training (CRCAT), is another recommendation. AI developers would be permitted to mine copyrighted content under this approach in exchange for payments into a centralised licensing or compensation program, which would subsequently distribute the appropriate sums to the copyright holders. This strategy would aim to give the relevant inventors financial incentives while reducing transaction costs and avoiding the difficulties of individual licensing. However, this technique is also not without its governance issues. Collective management plans can be opaque, vulnerable to manipulation by strong industrial interests, and difficult to distribute compensation, particularly to Marginal creators.

Lastly, legislators have proposed requiring AI training datasets and outputs to be transparent and labelled. By requiring the disclosure of training data, outputs, or the usage of copyrighted content, this approach targets accountability rather than the problem of copyright liability itself. Requirements for transparency can boost public trust and make downstream accountability easier, but they don’t solve the clearance problem by themselves. Without a significant exception or defence, transparency might just increase liability without offering a legitimate way to comply.

When combined, these models show that no one approach is adequate to achieve a balance between authors’ rights, innovation, and access. A hybrid strategy that incorporates a mandatory research exception, a legal user text and data mining right with a specific and discoverable opt-out procedure, governance protections for any collective licensing scheme, and particular transparency requirements for high-risk AI applications is India’s best chance. Without overprotecting or underprotecting creative works, this hybrid approach would bring copyright law into line with technological advancements.[22]

PRACTICAL INTERIM MEASURES FOR COURTS AND POLICYMAKERS

Courts and legislators can implement workable temporary solutions to mitigate the most urgent negative effects of the current legal ambiguity surrounding text and data mining, even in the absence of any immediate legislative action. Considering non-expressive and non-substitutive text and data mining for legitimate research objectives as presumptively legal, with the proper protections in place, is one potential temporary solution. Such an approach would enable courts to draw a boundary between those applications that are just extractive for training purposes and those that are directly competitive with the original market for the work.

Additionally, rightsholders might be required to impose any desired limitations on text and data mining through the use of machine-readable, standardised opt-out signals and transparent, equitable licensing terms. This would guarantee more openness in licensing markets and help prevent opportunistic enforcement. Simultaneously, the State might adopt a proactive stance by promoting voluntary open data projects, particularly by creating public corpora for AI training data. By seeding the market with high-quality publicly accessible datasets, this would lessen the issue of data monopolies and facilitate the entry of smaller innovators.

Lastly, for large-scale or high-impact AI models, legislators should work to implement graduated transparency requirements that include provenance information and descriptions of the underlying datasets. These actions may be helpful temporary fixes that boost accountability and enable well-informed regulatory decision-making, even though they do not replace significant copyright reform. When combined, these steps can lessen the tension and unfairness that now exist in the AI training data market while India creates a legislative solution that looks ahead.[23]

CONCLUSION – RESISTING A COPYRIGHT-FUELED KNOWLEDGE ENCLOSURE

India is at a turning point in its history. Due to TDM’s technological significance for AI, copyright law will have a big impact on how AI marketplaces develop. The knowledge enclosure of what was formerly a public commons and the lock-in of incumbents could result from a passive enforcement posture that assumes all TDM is infringement unless licensed. However, an overly forgiving exception can underpay rights holders and discourage investment. The ideal way to balance these conflicting factors would be to implement an internally consistent, well-balanced strategy that includes a discoverable opt-out for commercial products, a legislative research exemption, and governance requirements for any centralised licensing structure. An open, pluralistic AI ecosystem that promotes creativity and treats creators with respect and dignity should be made possible by the Indian copyright statute.

Reference(S):

[1] On TDM technical process and reproduction issues, see generally Reed Smith, Text and Data Mining in the EU (Feb. 5, 2024).

[2] On economic concentration and data enclosure risks, see empirical and policy commentary (collected analysis). See Dev Gangjee, Relocating the Anti-Dilution Debate, 12 J. World Intell. Prop. 1 (2009) (analogy re: enclosure dynamics).

[3] Directive (EU) 2019/790, Arts. 3–4 (Text & Data Mining exceptions for research and lawful users).

[4] Id. Art. 3(2)–(3); see also WIPO coverage of the DSM Directive.

[5] Directive 2019/790, Art. 7(1) (contractual overrides unenforceable).

[6] On opt-out and enforcement challenges, see Reed Smith commentary on EU TDM.

[7] Copyright, Designs and Patents Act 1988, § 29A (U.K.) (statutory TDM exception for lawful users).

[8] See UK commentary on Section 29A and its scope (legal briefings).

[9] Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015); Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014) (US fair use decisions validating scanning/indexing).

[10] On tradeoffs between statutory clarity and fair use flexibility, see scholarship comparing US/EU approaches.

[11] Copyright Act, No. 14 of 1957, § 14 (meaning of copyright; reproduction and storage rights).

[12] Copyright Act, 1957, § 52 (exceptions to infringement; fair dealing for private use and research).

[13] On the limits of Section 52 and its original drafting context, see doctrinal analyses and legal commentaries.

[14] On the chilling effect of uncertainty in India for TDM/AI training, see contemporary commentary and working papers critiquing India’s approach to AI copyright reform.

[15] See analyses of Indian jurisprudence construing Section 52 (collecting cases).

[16] Authors Guild v. Google, 804 F.3d at 216–24; HathiTrust, 755 F.3d at 91–99 (transformative purpose and public benefit).

[17] CJEU jurisprudence and EU Directive context on temporary copies and exceptions; see DSM Directive discussions.

[18] On market power and enclosure in data economies, see policy reports and scholarly critiques (selected sources).

[19] On the functional nature of embeddings and technical transformations, see AI/data studies and legal commentary on non-substitutive uses.

[20] Policy argument: copyright as incentive for learning; see legislative history and policy literature.

[21] Comparative proposals and best practices for TDM exceptions and opt-out flags; see EU DSM Directive and UK reforms.

[22] On hybrid policy packages (exception + opt-out + collective) and governance design, see recent Indian policy commentary.

[23] Interim safeguards and transparency suggestions; see academic proposals and EU policy materials.

Authored By: RIYA SINGH

HIMACHAL PRADESH NATIONAL LAW UNIVERSITY, SHIMLA

Related Posts

Leave a Comment Cancel Reply