RESEARCH ARTICLE - IA de Investigación especializada en el MIT

ARTÍCULO DE INVESTIGACIÓN

Using web content analysis to create innovation
indicators—What do we really measure?

Mikaël Héroux-Vaillancourt

, Catherine Beaudry

, and Constant Rietsch

Canada Research Chair on the Creation, Development and Commercialization of Innovation, Department of Mathematics and
Industrial Engineering, Polytechnique Montréal, P.O. Box 6079, Downtown office Montreal, Quebec, H3C 3A7, Canada

Palabras clave: construct validity, innovation measurement, multitraits multimethods, web content
análisis, web-mining, word frequency analysis

ABSTRACTO

This study explores the use of web content analysis to build innovation indicators from the
complete texts of 79 corporate websites of Canadian nanotechnology and advanced materials
firms. Indicators of four core concepts (R&D, IP protection, colaboración, and external financing)
of the innovation process were built using keywords frequency analysis. These web-based
indicators were validated using several indicators built from a classic questionnaire-based survey
with the following methods: correlation analysis, multitraits multimethods (MTMM) matrices, y
confirmatory factor analysis (CFA). The results suggest that formative indices built with the
questionnaire and web-based indicators measure the same concept, which is not the case when
considering the items from the questionnaire separately. Web-based indicators can act either as
complements to direct measures or as substitutes for broader measures, notably the importance of
R&D and the importance of IP protection, which are normally measured using conventional
methods, such as government administrative data or questionnaire-based surveys.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

INTRODUCCIÓN

The majority of researchers in innovation and technology management today still rely on public
databases and questionnaire-based surveys to obtain most of their data to perform quantitative
studies on industrial strategies and innovation activities. Sin embargo, public databases are often
incomplete or too general in nature. Although questionnaires remain precise instruments, el
process of designing, pruebas, and administering questionnaire-based surveys can be especially
time-consuming and costly for researchers. Además, oversolicitation of respondents and of
their time militates against questionnaire-based surveys, which suffer from increasingly lower re-
sponse rates (less than 10%) and thus threaten the validity of studies performed using such
methods, for instance because of the potential for significant nonresponse biases.

To complement questionnaire-based data, social scientists have often used secondary data.
With the development of “Big Data” analytical tools, websites are increasingly recognized as
additional information gold mines. Researchers in innovation and technology management are
now investigating whether they can use the information that organizations provide on their
websites to acquire valuable additional data for their research. Technology companies generally
maintain websites that allow the media, as well as potential investors, customers, suppliers, y
collaborators, to learn about the nature of their activities. The information provided on company
websites is as rich as it is diversified, including products, services, business models, R&D activ-
ities, y más. As these corporate websites are freely available to anyone with internet access,

un acceso abierto

diario

Citación: Héroux-Vaillancourt, METRO.,
Beaudry, C., & Rietsch, C. (2020). Usando
web content analysis to create
innovation indicators—What do we
really measure? Quantitative Science
Estudios, 1(4), 1601–1637. https://doi.org
/10.1162/qss_a_00086

DOI:
https://doi.org/10.1162/qss_a_00086

Recibió: 26 Marzo 2019
Aceptado: 28 Puede 2020

Autor correspondiente:
Mikaël Héroux-Vaillancourt
mikael.heroux-vaillancourt@polymtl.ca

Editor de manejo:
Vincent Larivière

Derechos de autor: © 2020 Mikaël Héroux-
Vaillancourt, Catherine Beaudry, y
Constant Rietsch. Published under a
Creative Commons Attribution 4.0
Internacional (CC POR 4.0) licencia.

La prensa del MIT

Using web content analysis to create innovation indicators

researchers need to evaluate their value as data sources. Before fully validating their usage, a few
questions need to be answered: Is it possible to extract the information contained on these web-
sites and convert it into useful data for research purposes? Is the available information reliable and
sufficient to provide an accurate picture of the firm’s specific characteristics? En otras palabras, poder
the contents of a corporate website be used to identify different business innovation attributes?

This research aims to study high technologies that should have emerged by now, nanotech-
nologies and advanced materials, but that are still difficult to assess because of their pervasiveness
throughout the industry. Nanotechnologies and advanced materials, as so-called enabling tech-
nológico, have innumerable applications and are present in all major industrial sectors, como
alimento, agricultura, electronics, renewable energy, the environment, biomedical, and healthcare.
The ability to develop and use nanotechnologies and advanced materials is expected to give im-
petus to innovation performance and economic growth (Hwang, 2010). The versatility of these
advanced technologies’ applications and the sectors affected by them make them particularly
interesting for studies in innovation management because they allow an intersectoral bird’s-
eye view of the industry. Sin embargo, owing to the pervasiveness of nanotechnologies and
advanced materials throughout the economy, national statistics and other data repositories often
fail to accurately measure their impact. A potential solution to alleviate this perceived lack of
accurate data lies in the information contained on company websites.

The clear majority of companies working in high-technology fields, such as those producing or
using nanotechnology and advanced materials, maintain updated websites. Although the online
information is made available by the companies themselves, which suggests the possibility of a
strong self-reporting bias, this source of information is deemed suitable for the study of the emerging
tecnologías (Gök, Waterworth, & Shapira, 2015). Por ejemplo, Youtie, Hicks, et al. (2012) noted
that small businesses tend to have less content-heavy websites, which facilitates the handling of
datos. A successful web mining analysis has several advantages over questionnaires, scientific
publicaciones, and patents. Primero, the population covered by searching the web (web crawling) is very
grande (Herrouz, Khentout, & Djoudi, 2013) compared to questionnaire-based studies, cual
generate few returns (low response rate). Contrary to government data, the frequency of updates
es alto, even daily, in most cases (Gök et al., 2015). One significant drawback, sin embargo, is that
companies do not disclose all of their strategic and business data on their websites. The main
disadvantage of such web-based data stems from problems inherent in organizing and interpreting
highly unstructured information, as each site organizes various types of information differently. En
addition, we do not yet know exactly what the significance is of what we are measuring, hence the
necessity to concurrently use both traditional and webometrics methods and to compare them with
one another.

en este estudio, we analyzed and compared four sets of measures of innovation and commercial-
ization respecting nanotechnology and advanced materials firms in Canada stemming from
two different data gathering techniques: word frequency analysis from web content and
questionnaire-based surveys. Comparisons between the results from both methods were obtained
via correlations. To ensure a convergent and discriminant validation of our results, a multitraits
multimethods (MTMM) technique was then performed on the most significant measures.
Formative indices were built to determine the best representation of the web-based indicators.
One final MTMM matrix was used, along with an MTMM confirmatory factor analysis (CFA),
as a post hoc analysis to ensure the robustness of our results.

The remainder of the article is organized as follows: Sección 2 presents the theoretical devel-
opment of the innovation and commercialization of high technologies, such as nanotechnology
and advanced materials, along with web content analysis; sección 3 describes the methodologies

Estudios de ciencias cuantitativas

1602

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

used to collect data and the MTMM method; sección 4 analyzes the results surrounding propo-
sitions 1 y 2, and performs a post hoc analysis; sección 5 discusses the results and conclusions;
and lastly, sección 6 addresses the research limitations and future research.

2. THEORETICAL DEVELOPMENT

2.1. Using the Web as a Data Source for Social Sciences

A few years after the invention of the world wide web, its use as a means to get valuable data for
social science studies was introduced (Almind & Ingwersen, 1997). With more and more online
content from all aspects of society being produced, the web becomes a digital representation of
the social construct that embodies our civilization. With such a rich repertoire of information,
social scientists started to explore ways to leverage the web as a data source for research
(Björneborn & Ingwersen, 2004). Como resultado, measuring different elements of the web, como
websites, webpages, parts of webpages, words in webpages, hyperlinks, and web search engine
resultados, has been undertaken by a number of scholars, paving the way to a new field of research:
webometrics, as an evolution from the classic bibliometric studies (Thelwall, 2009).

Link analysis, blog searching, and web impact assessment were the main focus of late 2000s
webometrics research. The first, link analysis, consists in analyzing the network of hyperlinks
contained in web pages in the hope that they may reflect an organization’s connections (Hyun
kim, 2012; katz & Cothey, 2006; Vaughan, 2004). This method has proven useful to understand
the structure of the collaborative network and the importance of a given actor within its network
(Minguillo & Thelwall, 2012; Stuart & Thelwall, 2006). The interest in the second method, explor-
ing the content of blogs, resides in their vast heterogeneous public-generated information at a
given time. Their understanding provides information on the trends in public opinion on a given
set of topics. Hoy en día, blog searching is more prevalent in social media platforms such as
Twitter (Thelwall, Buckley, & Paltoglou, 2011), and the importance of topics can be effectively
monitored through Google Trends (Choi & Varian, 2012). The third, web impact assessment,
offers a method for measuring the direct online impact of specific topics or documents by
counting their presence on the web. These could be a good proxy for offline impact due to the
dominant penetration of the internet as a means of information and communication (Thelwall,
2009). Web impact assessment could help identify patterns of diffusion and relative impact of
autores, terms, scientific theories, political candidates, books, journals, etcétera. In the case
of innovation studies, the web could be viewed as a way to obtain indirect indicators, como
the number of attendees at a given event, the media coverage for a given product launch, or the
number of requests for an organization’s literature or the number of times an organization’s
concepts or publications are mentioned (Thelwall, 2009).

To perform web analysis, three main areas of web mining are generally used: web structure
mining for link analysis; both web usage mining and web content mining for web impact assess-
mento; and blog searching (Miner, Elder, et al., 2012). Of particular interest for this paper is web
content mining, which is a technique that turns unstructured information in the form of text con-
tained in web pages into structured data useful for research purposes.

Content analysis is an objective and quantitative method used to find relationships between
textual information and the context from which the information is sourced (Krippendorff, 1980).
The benefits and challenges of using the web as a data source for content analysis were already
discussed at the beginning of the millennium (Weare & lin, 2000). The main opportunity from
using the web as a data source stems from the explosion of possible sources of multimedia infor-
formación (texto, video, audio, and image) that can be used for content analysis, which was antici-
pated to reduce the cost of data acquisition significantly. Sin embargo, the information contained on

Estudios de ciencias cuantitativas

1603

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

the web is highly unstructured, completely decentralized, and not standardized in any way,
which makes the quality and validity of web content analysis particularly challenging to verify
(Weare & lin, 2000) and somewhat costly to process.

Despite the limitations, some web content analyses were performed in innovation studies. Para
ejemplo, Herrouz et al. (2013) used a web scraping method to obtain data from small and
medium-sized high-technology graphene firm websites in the United States, Reino Unido,
and China. More recently, Gök et al. (2015) proposed a web content analysis based on a
keywords frequency analysis to assess the R&D activities of 296 UK-based green goods small
and midsize enterprises (SMEs). They compared the results of their R&D web indicator with
the results obtained through a questionnaire-based survey. The results of this study showed that
the web-based indicator did not correlate significantly with the nonweb-based indicator, cual
suggests that these two indicators did not reflect the same concept. It is therefore possible that
web-mining indicators provide new information that was not captured through classical method-
ologies and thus are shown to be complementary.

2.2. Construct Definition

Following in the footsteps of Gök et al. (2015), in this study we focus on parameters influencing
innovation and commercialization in Canadian high-technology firms. We consider the four most
important factors known to influence innovation and commercialization in high-tech firms, en nuestro
caso, nanotechnologies and advanced materials: investigación & desarrollo (R&D) intensidad, intellec-
tual property (IP) protection, colaboración, and external financing (Sotavento, Sotavento, et al., 2013).

R&D intensity refers to the effort a firm makes to generate inventions (Griliches, 1990, 1994,
1998; Hausman, Sala, & Griliches, 1984; Hitt, Hoskisson, & kim, 1997) as well as investment in its
own absorptive capacity (cohen & Levinthal, 1990). Greater R&D efforts are likely to yield better
returns to innovation. En efecto, the positive relationship between R&D efforts and innovation
performance has been demonstrated in several quantitative studies (Baysinger & Hoskisson,
1989; Deeds, 2001; Greve, 2003; Griliches, 1998; Hagedoorn & Cloodt, 2003; Sala, 1990;
Parthasarthy & Hammond, 2002). Innovation and R&D efforts have been shown to positively
influence firms’ commercialization and financial performance (Geroski, Machin, & Van
Reenen, 1993; Klette, Møen, & Griliches, 2000). R&D efforts should therefore give nanotech-
nology and advanced materials firms a technological superiority as compared to the market.
R&D intensity is a measure of intention, y, tal como, not a direct measurement of the perfor-
mance of the R&D process as concerns the production of knowledge, inventions, and innova-
ciones (Adams, Bessant, & Phelps, 2006; Cebon, Newton, & Noble, 1999; Flor & Oltra, 2004;
Kleinknecht, Van Montfort, & Brouwer, 2002). Además, not all innovations are systematically
derived from this internal R&D process (Michie, 1998), as highlighted by the rising popularity of
open innovation practices (Chesbrough, 2003). Además, measures based solely on R&D are
not the best suited for SMEs because their efforts are often neither formal (Kleinknecht et al.,
2002) nor constant (Michie, 1998). For these reasons, researchers use R&D intensity less as a
proxy for innovation, as suggested by (Becheikh, Landry, & Amara, 2006), and more as a means
to achieve higher innovation performance.

IP protection confers a competitive advantage to companies by offering exclusivity for the
commercialization of technologies, branding, derechos de autor, industrial design, etcétera. (Teece,
1986). Patents enhance the return on investment of the amounts dedicated to technology devel-
opment through temporary monopoly power, license granting, or the sale of patents (Arora, 1995;
Chesbrough, 2003; Feldman & Florida, 1994; Fosfuri, 2006; Mazzoleni & nelson, 1998; Merges,
1999; Rivette & kline, 2000). Laursen and Salter (2006) observed a positive relationship between

Estudios de ciencias cuantitativas

1604

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

appropriability and innovation performance, but one that eventually leads to decreasing returns.
A patent, as one of the outputs of the research process, is often considered as a measure of inven-
tiveness (Coombs, Narandren, & Richards, 1996; Flor & Oltra, 2004; OECD & Statistical Office of
the European Communities, 2005). Although patent statistics are frequently used as proxies for
innovative activities (Pavitt, 1985) or innovation (Becheikh et al., 2006), they have also been the
subject of considerable criticism (Archibugi, 1992; cohen & Levin, 1989; Dosi, 1988; Griliches,
1998). The use of patents varies considerably from one sector to another (Archibugi & Sirilli,
2001; Armellini, Beaudry, & Kaminski, 2017; Armellini, Kaminski, & Beaudry, 2014; Michie,
1998) and according to firm size, because SMEs, which have fewer financial resources to protect
their IP, are systematically disadvantaged.

In contrast to the first two types of indicators, collaboration has a much broader impact on the
innovation process: from idea generation and research to commercialization. Collaboration thus
has a positive impact on several innovation performance indicators, such as the number of
patents, sales growth, and the return on innovation sales (Arvanitis, 2012; Belderbos, Carree,
& Lokshin, 2004; Carboni, 2013), and has been shown to improve competitiveness
(Hagedoorn, Link, & Vonortas, 2000; Roja & Nastase, 2013). A constant collaborative effort is
essential for the development and deployment of emerging technologies, which are becoming
increasingly complex, as well as for reducing R&D costs, sharing risk, and increasing perfor-
mance (Johnson & Filippini, 2009; parker, 2000). Por ejemplo, the aerospace industry requires
all actors to adopt a policy of interorganizational cooperation to take advantage of all the
necessary knowledge and know-how available (Jordán & Lowe, 2004). In this field, colaboración
is so important that it spans all types of partners: clientela (Armellini et al., 2017), including suppliers
(Bozdogan, Deyst, et al., 1998); competitors (Esposito, 2004; Frear & Metcalf, 1995); and univer-
sities and institutes (Armellini et al., 2014, 2017). McNeil, Lowe, et al. (2007) have shown that
collaboration with universities or government institutes enables young high-tech firms to access
particularly expensive tools. Además, kim, Sotavento, and Marschke (2014) highlighted the impact of
university research on scientists associated with the training and development of skilled labor,
patents, and innovation in industry.

Finalmente, as most nanotechnology and some advanced material projects are still in the early
development stages, they rely heavily on private and/or public funding to reach the commercial-
ization and marketing phases (Kalil, 2005; McNeil et al., 2007). In surveys on barriers to innova-
ción, firms very often refer to the lack of external financing as a major obstacle to their innovation
activities (Harhoff & Körting, 1998). Venture capital investment in young U.S. high-tech firms had
a positive and significant effect on their R&D productivity in the 1990s (Marrón, Fazzari, &
Petersen, 2009). For more than a decade, substantial public investments have targeted nanotech-
nologies worldwide (Crawley, 2007); more recently, en 2018, C$1.2 billion were invested in the United States (National Nanotechnology Coordination Office, 2017). As the development of these emerging technologies requires public support, the Innovation, Science and Economic Development Canada (ISED, formerly known as Industry Canada) website proposes 13 programs dedicated to funding the development of nanotechnologies and advanced materials. These public funds often have a lever effect and firms that receive such funding in grant form are more likely to also successfully raise private funding and receive better external financing offers (Meuleman & De Maeseneire, 2012). These four pillars of the innovation process (R&D, IP, colaboración, and external financing) have been the focus of much literature and have been extensively measured by both re- searchers and national statistics offices. But do they transpire in corporate websites? A number of scholars have attempted to analyze web content from the information contained on com- panies’ websites to build indicators and innovation metrics (Gök et al., 2015; Herrouz et al., Estudios de ciencias cuantitativas 1605 l D o w n o a d e desde h t t p : / / directo . mi t . / e d u q s s / a r t i c e – pdlf / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 pd / . f por invitado 0 8 septiembre 2 0 2 3 Using web content analysis to create innovation indicators 2013). Yet understanding the “real” meaning of measures created from data gathered with web mining techniques is not a trivial task. Gök et al. (2015) showed that their web-based indicator based on R&D-related keywords did not correlate significantly with nonweb-based indicators, which suggested that these two indicators did not reflect the same concept. Por lo tanto, the web mining indicator could represent new information that was not captured through classical methodologies and could possibly be complementary. en este documento, two propositions will be tested to assess whether these web-based indicators can be used as substitutes for indicators obtained by classical questionnaire-based methods. Within corporate websites, innovation words and factors are referred to using several syno- nyms and other related terms. Companies may use words on their website to provide insight into what they actually do. The more a company uses terms related to a specific factor, the stronger the signal is deemed for that factor, and therefore, the more likely the firm is liable to perform activities related to that specific factor. De este modo, for each factor mentioned above, we suggest the following proposition: Proposition 1: The more words related to a factor, in our case R&D intensity, propiedad intelectual, colaboración, and external financing, that are used on a firm’s website, the more a firm is likely to perform activities related to that factor. Connecting keywords found on websites with tangible actions taken by those companies is a bold proposition that may not be valid empirically. The content of corporate websites is written with the aim of signaling the company’s “best” characteristics to all its internal and external stakeholders. This represents a rather modest but reasonable claim against the validity of the information to be found, which may suffer from a possible self-reporting bias. Sin embargo, it is reasonable to assume that the information provided on companies’ websites will be as true to the company’s intentions as to the various concepts it wishes to share with the world. The information that stands out on a website can give a broad sense of the importance of a specific factor for the firm. Proposition 2: The more words related to a factor, in our case R&D intensity, propiedad intelectual, colaboración, and external financing, that are used on a firm’s website, the more important a firm considers that factor. The two propositions will be tested and compared to the results obtained with a classic questionnaire-based survey using the methodology explained in section 3. 3. DATA AND METHODOLOGY 3.1. Questionnaire-Based Data Collection For a more general study of innovation in nanotechnologies and advanced materials, we first conducted a classic questionnaire-based survey, the core of which is based on the Oslo Manual (OECD & Statistical Office of the European Communities, 2005), to explore the following themes: innovation, commercialization, colaboración, and IP protection, among other topics. The questionnaire includes the importance of the sources of knowledge and the importance of innovation activities measured by a Likert scale from not important (1) to essential (7). Other questions cover the importance of commercialization actions, the proportion of financing for R&D and commercialization, the importance of market impacts, the number of exports, the importance of obstacles to commercialization, whether or not the firm collaborated to develop or commercialize the latest most significant product innovation, the importance of collaborators, Estudios de ciencias cuantitativas 1606 l D o w n o a d e desde h t t p : / / directo . mi t . / e d u q s s / a r t i c e – pdlf / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 pd . / f por invitado 0 8 septiembre 2 0 2 3 Using web content analysis to create innovation indicators the importance of the reasons for collaboration, IP mechanisms, patent management, and general questions such as the number of employees, revenue, and the firm’s business sector. A sample of the questionnaire is provided in Appendix 1. Due to the ubiquitous nature of nanotechnology and advanced materials, combined with the fact that they are adopted in a large number of industrial sectors, identifying companies that use or develop these technologies is not straightforward. Firms that either use or develop nanotechnology and advanced materials are not labeled as such, nor are they searchable in any obvious way. Como consecuencia, we used all possible means to build an exhaustive list of high-tech companies with a higher probability to be involved in nanotechnology or advanced materials with the help of several sources, including ISED, Nano-Québec (now Prima-Québec), Nano Ontario, Nanowerk, futuremarketsinc.com, and AGY Consulting, a Canadian firm specialized in emerging technologies such as nanotechnology, clean technology, and biotechnology. Combining all these data sources resulted in more than a thousand unique entries. After manually removing all obvious noneligible companies, a final list of 592 high-tech companies was then put together. To maximize our response rate, we contracted Léger360 (now Léger), a well-known survey company, to administer the questionnaire and find new companies eligible for our study. Their professional approach is appreciated by firms that have responded to surveys in the past. Data collection began in September 2016 based on a convenience sampling method. To cover even more ground, we also used the snowball method. Más precisamente, respondents first had to answer the eligibility question: “Does your company develop or/and commercialize nanotechnologies and/or advanced materials?” Then, if the firms were eligible, the respondents were asked whether or not they were interested in participating in our study. Finalmente, they were asked whether they had any recommendations for any potential respondents who might be eligible for our investigation as an attempt to increase our sample size. Firms that agreed to participate in the survey but did not complete the questionnaire received two telephone reminders at 1-week intervals. Después 332 solicitations, the incidence rate was 28%, which indicated that 93 companies were eligible. To maximize our sample size, we added companies to the list by using the common charac- teristics of the eligible respondents. Companies that were eligible for the study (el 93 eligible companies listed above) were grouped and classified according to their North American Industry Classification System (NAICS) codes. Twenty-three six-digit NAICS codes corresponding to 67% of the eligible companies were identified. Léger360 acquired a list of 3,345 companies representative of the frequency distribution of the NAICS code obtained from InfoGroup. We solicited 2,971 Canadian high-tech companies through the entire data acquisition process. De estos, 973 companies did not respond, 1,459 were not eligible, 380 refused to participate, y 168 eligible companies agreed to participate, of which 89 respondents completed the survey. Because of Léger360’s code of ethics, we have no means of identifying whether respondents refused to participate before or after answering the eligibility questions. The unknown nature of the nonrespondents necessitates an approximation of the response rate. Assuming a uniform distribution, a 12% response rate was obtained by assigning the same characteristics to respondents who answered the eligibility questions to nonrespondents and respondents who refused to partic- ipate (ver tabla 1). As the population is unknown, we have a nonprobabilistic convenience sample for which the methodology may have induced a selection bias. Además, we assume that the respondents were honest and answered the survey with goodwill. The resulting sample represents a diverse selection of Canadian high-tech firms that develop or use nanotechnologies or advanced materials. Además, 74% of the firms are considered nano- technology or advanced materials intensive, which means that at least 80% of their revenues come from nanotechnology or advanced materials innovations. The different application domains Quantitative Science Studies 1607 l D o w n o a d e desde h t t p : / / directo . mi t . / e d u q s s / a r t i c e – pdlf / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 pd / . f por invitado 0 8 septiembre 2 0 2 3 Using web content analysis to create innovation indicators Table 1. Estimated response rate Number of high technological Canadian companies: Reached (A) Did not answer (B) Refused to participate (C) Unknown status (A − B − C = D) Not eligible (mi) [Rate of Not Eligible (E/A = F)] Estimated not eligible (D × F = G) Estimated eligible (A − E − G = H) Total accepted Total completed [Estimated response rate] 2,971 973 380 1,618 1,459 [49%] 795 718 168 89 [12%] are as follows: 54% in advanced materials, 21% in biotechnology and medicine, 24.4% in elec- tronics, 23.3% in equipment and devices, 13.3% in photonics, y 33.3% in other domains. Más que 50% of respondents are small businesses and 83.5% are SMEs. El $94 million revenue
average decreases to $31 million when the three largest firms are excluded. Finalmente, 85% of the firms
come from Quebec (54.5%) and Ontario (30.7%), y 12% are from British Columbia and Alberta.

To test several types of biases, such as self-reporting bias and nonrespondent bias, we selected
79 eligible enterprises that did not participate in the study as a control sample. The Canadian
government’s1 Directory of Canadian Companies was used as an external source of data to rule
out the possibility of nonresponse bias. The database of companies from different sectors com-
prises information provided by the companies themselves on a voluntary basis. Aunque el
Canadian government does not guarantee the accuracy or the reliability of the content, we as-
sumed that the companies that willingly update information in an official public database input
accurate information. We therefore assume that this mitigates the self-reporting bias that might
arise from this type of source. The database provides the number of employees for 37 firms and
revenues for 30 firms from our main sample, as well as the number of employees for 29 firms and
revenues for 26 firms from our control sample. Using a two-tailed Mann-Whitney U test, we com-
pared the main sample with the control group for these two metrics. We did not find any signif-
icant difference (en el 5% nivel) between the two samples for either metric (pag = 0.115; pag = 0.166
for the number of employees and revenues respectively), which suggests the absence of a non-
response bias: There are no significant differences between the characteristics of the respondents
and the nonrespondents from our list and thus the sample of eligible companies is homogeneous.

We then compared the data obtained from our questionnaire-based survey with the data from
the Directory of Canadian Companies using the same two metrics to verify whether an important
self-reporting bias exists. For every firm for which we had data from both our questionnaire-based

1 https://www.canada.ca/en/services/business/research/directoriescanadiancompanies.html, accessed June 4,

2020.

Estudios de ciencias cuantitativas

1608

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

survey and from the Directory, we tested each data pair with a two-tailed Wilcoxon Signed Ranks
Prueba. Una vez más, we did not find any significant difference (en el 5% nivel) between our
questionnaire-based survey results and the data from the Directory ( pag = 0.058; pag = 0.714),
which suggests that the self-reporting bias issuing from the questionnaire-based survey is no
different than that found in an official public database.

To build the four factors described above, R&D, IP protection, colaboración, and external
financiación, we identified all the relevant questions from the questionnaire-based survey and
transformed the answers to these questions into different types of variables. El 12 preguntas
used are listed in Appendix 1. Además, we transformed every continuous variable from the
survey that did not follow a normal distribution by applying a natural logarithm (ln) or an in-
verse function (inv). Some Likert-scale questions could not be normalized because they were
skewed on one tail or the other. We thus transformed them into dummy variables by attribut-
ing the value of 1 to answers associated with important (5), very important (6), and essential (7)
and the remaining were given the value of 0.

A principal component analysis (PCA) was then performed to validate reflective constructs.
Only constructs with sufficiently high Kaiser-Meyer-Olkin (KMO) measures (KMO > 0.6) eran
deemed valid, and only dimensions with Cronbach’s alpha above 0.7 (Hair et al., 1998) eran
considered reliable. We used PCA with a Varimax rotation on seven-point Likert-scale questions
that described the concept of R&D (R&D questions 2, 3, 5, y 6 in Appendix 1) to explore
whether any particular combination of variables could lead to relevant reflective dimensions
corresponding to specific factors of the concept examined. Two factors were created, but neither
the KMO measure nor Cronbach’s alpha reached an acceptable level (KMO < 0.6, alpha < 0.7) to satisfy the validity and reliability of the constructs. In addition, these combined variables did not correlate with each other, which suggested using a formative construct. We thus proceeded to treat each item individually rather than as a composite indicator. In the end, we generated nine variables pertinent to R&D, one variable relating to collab- oration, two variables corresponding to external financing, and two variables measuring IP. Table 2 describes the details of the questionnaire-based survey constructs. Descriptive statistics are presented in Appendix 2. 3.2. Web-Based Data Collection While we were processing the questionnaire-based survey data, we initialized the indexing robot with Nutch2, an open source Web scraper software. We provided the robot with an initial list of uniform resource locators (URL) corresponding to the 89 enterprises that compose our questionnaire-based data set. This list is the first level of indexing traversal by the robot, which then browses the structure of the entire website to extract all the URLs. To begin the data collection process, the URLs linked to a company’s website are written in a text file. The URLs are then injected into Nutch’s database. Once the database has been populated with the list of sites to be visited, the robot’s route is generated by indicating the maximum number of links per page that we want to collect. The robot ranks the pages with a score that it calculates to prioritize the text collection. For instance, a page will have a high score if it is pointed to by many other pages. In addition, the higher the scores of pages pointing to another page, the higher the score of the page pointed to. Nutch then selects the links of the pages with the highest score to be collected first. When the links to browse are 2 We used Apache Nutch 2.2.1 version. Documentation can be found on https://nutch.apache.org/apidocs /apidocs-2.2.1/index.html, accessed June 4, 2020. Quantitative Science Studies 1609 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Concepts R&D Number of R&D projects in nanotechnology/advanced materials Indicators Variables ln(nb_R&D_proj) (Continuous) Table 2. Questionnaire-based survey construct Dichotomized importance of internal R&D as a source of knowledge dInt_R&D_source (Dummy) Level of importance of Commercial laboratories/R&D firms/Technical Ext_R&D_source (Ordinal) Consultants as a source of knowledge Level of importance of contracting of external R&D service providers Contract_In_R&D (Ordinal) Level of importance of providing R&D services to third parties Contract_Out_R&D (Ordinal) Time of R&D Dichotomized importance of Private research laboratories/Research and Development firms as collaborators for the development and the commercialization ln(time_R&D) (Continuous) dCollab_R&D (Dummy) Dichotomized importance of accessing research and development dCollab_R&D_Reason (Dummy) from collaborators for the development and the commercialization Proportion of Canadian employees assigned primarily in R&D (%) propEmp_R&D (Continuous) IP Number of IP mechanisms used Number of patents nbIP (Count data) ln(nbPatent) (Continuous) Collaboration Use of collaboration for the latest innovation dCollab (Dummy) External financing Proportion of external financing for R&D (%) PropExtFin_R&D (Continuous) Proportion of external financing for commercialization (%) PropExtFin_Comm (Continuous) Total proportion of external financing (%) propTotExtFin (Continuous) Note: All continuous and ordinal variables are normal. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 generated, Nutch will collect all the HTML code from the pages as well as the links to the other pages. There are mainly three different types of links that structure the Internet network. Upstream links start from a page to one or more pages not necessarily on the same site. Downstream links are all links pointing to the same page. Finally, vertical links are all links within the same website (Van de Lei & Cunningham, 2006). The indexing robot scans all links without distin- guishing between domain names and ranks them by score to first scan the links it considers most interesting. We have therefore limited the robot to the vertical links of the domain names present in the initial list. This is possible by modifying the Nutch configuration file limiting the robot to the initial list of domains. By default, Nutch indexes only the text of pages in hypertext markup language (HTML) format. However, relevant information was also stored in other digital formats. We used the Tika plug-in to browse and index the text of documents in portable document format files (.pdf ) and Microsoft Word documents (.doc and .docx) format. A further difficulty stems from the fact that our sample included companies that use multiple languages on their websites. Most Canadian companies will have English or French content and some will have other languages. Therefore, we did take into account the fact that the information Quantitative Science Studies 1610 Using web content analysis to create innovation indicators could be given in languages other than English. There were several different cases to deal with. If a website was entirely in English, which corresponds to a majority of Canadian company websites, we then kept all the data from the website. If a website was available in several languages, including English, then only pages written in English were taken into account. If the website was exclusively available in French, the text was translated into English. Finally, the last case was a page whose language was neither French nor English. Such a page was not considered eligible for our study and, therefore, was not kept for the rest of the study. Language detection was performed by the Compact Language Detector software and the translation was performed with Google’s programming interface, Goslate. Translation with Goslate was also used by Arora, Youtie, et al. (2013) to analyze the content of Chinese graphene- producing company websites. We assigned to each recovered page a label of language, “French,” “English,” or “other” and translated the French pages into English for the relevant sites according to our assumptions. Due to technical limitations, such as the structure of the websites, only 79 of these firms (88%) provided enough information to be included in our study. In the end, more than 9.7 million words from 27,000 pages were extracted from the complete texts of corporate websites. We then used a word frequency analysis on the text present in the websites. More specifically, for the 79 websites captured, we gathered information on the four innovation and commercialization factors mentioned above. For each factor, we listed all the relevant keywords that appear in corporate websites. Figure 1 shows the complete process to build the web-based innovation indicators to be compared with the questionnaire-based indicators described in the previous section. Factors, keywords, and the web mining constructs are described in Table 3. R&D (Gök et al., 2015) and collaboration (Ramdani, 2014) keywords were selected from the literature, and IP and external financing were identified from our own investigations of the literature. The most relevant keywords of any paper, which are generally listed on its first page, particularly in the abstract3, serve as a basis for the list of keywords used for the construction of our factors. We used additional keywords related to specific public funds and programs. As mentioned earlier, Canada’s public programs and funding opportunities for companies for the development of nanotechnology and advanced materials projects are listed on the ISED website. Specific information on these funds and programs was converted into keywords tied to the external funding factor for the individual firms for which we extracted website data. Keywords were manually tested to ensure they did not lead to false positive results. Clustering using keyword frequency analysis with RapidMiner, a text mining program, enabled us to count the number of occurrences of each keyword for each factor. We transformed these clusters of occurrences into four continuous variables. Because the 79 companies differ in structure and size, and therefore present different quantities of information on their websites, we standardized each variable by dividing all occurrences by the total number of words appearing on their website and multiplied the resulting value by 1,000. For each continuous variable, we calculated the kurtosis and skewness measures to determine whether they followed a normal distribution. Given that none of the four variables did so, they were transformed by applying a natural logarithm (ln) or an inverse function (inv). In the case of external financing, because we did not reach normality with either transformation, this variable was therefore treated using nonparametric tests that do not require normality. 3 For this pilot/exploratory study, we did not mine entire articles to select the most common or important keywords related to specific contexts. Quantitative Science Studies 1611 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Figure 1. Data mining and treatment process. A possible selection bias arises from the fact that we selected only those companies that answered the survey. To verify whether this was the case, we ran our web mining program on the websites of the control sample and generated the same variables. We then used a two- tailed Student’s t test to test the difference in means for the following variables: ln(WEB_R&D) Table 3. The web mining construct Factors R&D Keywords research & development, research and development, r&d, researcher, product development, technology development, technical development, development phase, development program, development process, development project, development cent, development facility, technological development, development effort, development cycle, development research, development activity, fundamental research, basic research (Gök et al., 2015) IP patent, intellectual property, trade secret, industrial design Collaboration affiliation, collaboration, cooperation, partners, partnership, consorti, international consorti, global consorti External financing atlantic canada opportunities agency, business development bank of Canada, sustainable development technology, venture capital, atlantic innovation fund, nrc-irap, fednor, Industrial research assistance program, grants, private investment Indicators Natural logarithm of the number of keywords frequencies divided by the total number of words multiplied by 1,000 Variables ln( WEB_R&D) Inverse (1/x) of keywords frequencies divided by the total number of words multiplied by 1,000 inv( WEB_IP) Natural logarithm of the number ln( WEB_Collab) of keywords frequencies divided by the total number of words multiplied by 1,000 The number of keywords frequencies divided by the total number of words multiplied by 1,000 WEB_ExtFin* * All transformed variables are normal except WEB_ExtFin, which could not be normalized by any transformation. Quantitative Science Studies 1612 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators (p = 0.130), inv(WEB_IP) ( p = 0.083) and ln(WEB_Collab) ( p = 0.144), and concluded that there is a nonsignificant (at 5% level) difference between the two samples for these three variables. Finally, using a two-tailed Mann-Whitney U test, we compared the means of the variable WEB_ExtFin ( p = 0.008) from both our sample and the control sample and found a significant difference. For this particular variable, we cannot conclude that the means of the two samples are the same. A selection bias is therefore present for the variable WEB_ExtFin and will be included in the limitations of our research. 3.3. The Multitrait Multimethod (MTMM) Matrix Technique The literature describes two types of biases that can threaten construct validation: (a) Mono- method biases can occur when the by-products from the unique combination of a trait (in our case an innovation factor such as R&D, IP, collaboration, and external financing) and a measure introduce a systematic variance; and (b) by-products that are congeneric (i.e., inherent in the methods used) generate anomalies among the items forming the measures (Ortiz de Guinea, Titah, & Léger, 2013; Reinig, Briggs, & Nunamaker, 2007; Straub & Burton-Jones, 2007; Straub, Limayem, & Karahanna-Evaristo, 1995). First introduced by Campbell and Fiske (1959), the MTMM is designed for the convergent and discriminant validation of a construct where a set of t traits (interchangeable with factors in our case) are measured with m different methods. It has been suggested that this method is an effective way to verify the ability of new measurement methods to successfully measure what they are supposed to, while testing for the presence of mono-method biases in social science studies, such as psychology (Bar-Anan & Vianello, 2018; Guo, Aveyard, et al., 2008), education (Campbell, Michel, et al., 2019; Gulek, 1999), organizational research (Bagozzi, Yi, & Phillips, 1991), marketing research (Lugtig, 2017), and information systems (Ortiz de Guinea et al., 2013). Even though its use for construct validity has been employed selectively since its intro- duction in 1959, the MTMM matrix is still one of the very few possible ways to attempt con- struct validation and this is why we have been witnessing a resurgence in its use in the past few years. To our knowledge, this validation method has not been used in innovation studies or to assess new innovation metrics compared to the traditional means of measuring innovation, as in the Oslo Manual (OECD & Statistical Office of the European Communities, 2005; OECD & Eurostat, 2019). An example of an MTMM matrix with three traits (A, B, C) measured using three different methods is shown in Table 4. The matrix easily allows the visualization of four different means of assessing the quality of a measure: reliability, convergent validation, absence of either a com- bination of trait and method effects or a mono-method bias and, finally, discriminant validation (Campbell & Fiske, 1959). B1, (cid:1) A2, (cid:1) A1, (cid:1) C1, (cid:1) B2, and (cid:1) The diagonal of the matrix (also called the reliability diagonal) comprises six Cronbach alpha values ((cid:1) C2), which measure the correlation of the traits with them- selves and are thus an indication of the reliability of each measure for each trait (Campbell & Fiske, 1959). These mono-trait mono-method values indicate the reliability of a reflective mea- sure (Campbell & Fiske, 1959), which is required to ensure that a measure will produce the same result under the same conditions. High Cronbach’s alpha values on the matrix diagonal confirm the reliability of the traits; that is, that the variance of the measures behaves systematically (Churchill, 1979). In other words, any possible occurrence of a random error is minimized. The diagonal at the cross-section of Method 1 and Method 2 (also called the mono-trait hetero- method, or validity diagonal) is comprised of three correlation values (rA1,A2 , rB1,B2, and rC1,C2) that compare one trait measured with two different methods. High and significant values confirm Quantitative Science Studies 1613 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Table 4. The MTMM matrix technique Method 1 Traits A1 Method 2 B1 C1 A2 B2 C2 A1 / A1 rA1,B1 rA1,C1 rA1,A2 rA1,B2 rA1,C2 Method 1 B1 C2 A2 Method 2 B2 C2 / B1 rB1,C1 rB1,A2 rB1,B2 rB1,C2 / C1 rC2,A2 rC2,B2 rC2,C2 / A2 rA2,B2 rA2,C2 / B2 rB2,C2 / C2 the convergent validity of the methods (Campbell & Fiske, 1959); that is, that two different methods measure the same traits (Churchill, 1979). Two triangles, called hetero-trait mono-method triangles, represent the cross-sections of traits that belong to the same method within the matrix. They are comprised of three correlation values per triangle. For instance, the hetero-trait mono-method triangle of Method 1 comprises rA1,B1, rA1,C1, and rB1,C1, and that of Method 2 comprises rA2,B2, rA2,C2, and rB2,C2 (Campbell & Fiske, 1959). High correlation values (higher than the validity diagonal) within the hetero-trait mono- method triangle question the validity of the construct because of either a combination of trait and method effects or a mono-method bias. Finally, two other triangles that represent the cross-sections of traits that belong to different methods within the matrix (also called hetero-trait hetero-method triangles) are comprised of three correlation values per triangle. The hetero-trait hetero-method triangle of Method 1 is com- prised of rA1,B2, rA1,C2, and rB1,C2, and that of Method 2 is comprised of rB1,A2, rC2,A2, and rC2,B2 (Campbell & Fiske, 1959). Again, high values (higher than the validity diagonal) question the discriminant validity of the methods. The discriminant validity is used to assess whether a method fails to measure something it is not supposed to measure effectively (Peter & Churchill, 1986). Bagozzi et al. (1991) have raised questions about the reasonability of the criteria needed, the lack of standard practice, the negligence of the amplitude of the differences between pairs of correlations in the analysis, and the lack of information about the nature of the variation in measures due to traits, method or random error. Fiske and Campbell (1992) listed the exceptional conditions required to successfully benefit from the method. Bagozzi et al. (1991) suggested CFA as an alternative to the MTMM for construct validation, which, although it does not share the same limitations, is unable to differentiate the random error from the variance measure and unable to verify interactions between traits and methods. Therefore, performing both methods will mitigate the limitations inherent in either method taken individually and ensure robustness of the results. In this paper, we adopt the methodology used by Ortiz de Guinea et al. (2013), who opted for an MTMM analysis followed by a post hoc analysis with a CFA to validate their results. 4. RESULTS 4.1. Keywords Analysis More than 9.7 million words from 27,000 pages were extracted from the complete texts of 79 corporate websites. The details of the frequency analysis for each indicator are provided in Quantitative Science Studies 1614 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Table 5. For the R&D factor, 1,974 keywords were counted, from which six keywords (r&d, scientist, laboratory, research and development, researcher, laboratorie) account for more than 85% of the total frequency distribution. For IP, 75% of the total frequency distribution (397) is embodied by the keyword “patent,” hence suggesting that most of the signal provided by the IP indicator is actually related to patenting. This is hardly surprising considering our sample of nanotech firms. For collaboration of 1,769 keywords were counted, from which R&D IP Collaboration External financing Table 5. Keyword frequencies r&d scientist laboratory 480 368 274 Patent 300 partner intellectual propert 84 partnership grant venture capital 202 31 trade secret collaboration business development bank of canada 967 422 132 research and development 246 patent pending 8 4 1 alliance 88 industrial research assistance program private investment atlantic canada opportunitie public fund public funding collaborative cooperation partnering collaborating cooperative cooperating 62 35 34 18 9 2 patent protect 185 165 99 64 33 15 12 10 5 4 4 2 2 2 1 1 1 1 researcher laboratorie product development technology development research & development development phase development project development effort basic research technological development development program fundamental research development cycle development center development research development process development facility development faciliti Total Mean Std 1,974 89.72 138.23 397 66.17 118.91 1,769 147.41 283.20 251 31.38 69.67 Quantitative Science Studies 1615 7 4 4 1 1 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators more than half are associated with partnering activities. On external financing, we only obtained 251 keywords, from which, more than 80% of the frequency distribution was dominated by the word “grant.” 4.2. Testing Proposition 1 Construct validation Each pair of variables related to the same concept from the two methods (web mining and questionnaire-based survey) was compared, using a Pearson correlation analysis when the var- iables followed a normal distribution, and a Spearman correlation when they did not, to assess whether the variables stemming from the web mining technique could be used as a proxy for similar concepts measured by a survey. Details of the construct comparison of the web-based indicators and the questionnaire-based indicators are provided in Table 6. Correlation results All correlation results between the web-based indicators and the questionnaire-based indica- tors are detailed in Appendix 3. For this study, the correlations were tested with two-tailed test at the 5% level of significance. Our results show a correlation of 0.306 ( p < 0.01) between the R&D web-based indicator and whether a firm is likely or not to provide R&D services to third parties (from the questionnaire). Additionally, we find a correlation of 0.306 ( p < 0.01) when we associate the R&D web-based indicator with whether or not a firm has a high percentage of employees working on R&D tasks. Moreover, a correlation of 0.284 ( p < 0.05) is observed Table 6. Validation construct for testing proposition 1 Concepts R&D Web-based variables ln( WEB_R&D) Questionnaire variables ln(nb_R&D_proj) dInt_R&D_source Ext_R&D_source Contract_In_R&D Contract_Out_R&D ln(time_R&D) propEmp_R&D dCollab_R&D dCollab_R&D_Reason nbIP ln(nbPatent) dCollab propExtFin_R&D propExtFin_Comm propTotExtFin IP Inv( WEB_IP) Collaboration ln( WEB_Collab) External financing WEB_ExtFin Quantitative Science Studies 1616 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators between the R&D web-based indicator and whether or not a firm is likely to contract R&D services from external providers. Finally, we find a nonsignificant correlation of 0.197 ( p = 0.100) when we associate the R&D web-based indicator with whether or not a firm has a long R&D process. The fact that the variable related to the number of R&D projects is not correlated with our R&D web-based indicator (r = 0.002; p > 0.1) strongly suggests that the latter does not
properly account for the size dimension of R&D activities.

The IP web-based indicator strongly correlates with the variables from the survey regarding
the use of IP protection mechanisms (r= 0.368; pag < 0.01) and patent-related activities (r = 0.396; p = 0.5). The web content analysis method seems to be able to appropriately capture the importance of the use of IP mechanisms using our sample. Relations between the web-mining and questionnaire-based indicators for the other two factors are not as strong. For instance, the collaboration web-based indicator is weakly correlated and nonsignificant at 5% (r = 0.222; p < 0.1) with the questionnaire-based indicator for the firms that are confirmed as having collaborated. Finally, the external financing web-based indicator is also weakly correlated and nonsignificant at 5% with the extent of the use of external funding for commercialization purposes (r = 0.222; p < 0.1). Consequently, we cannot reach a definite conclusion regarding proposition 1, especially in the case of our web-based indicators for collaboration and external financing (a less constraining significance level of 10% would be required to accept the correlations). In the following para- graphs, we will restrict our analyses to the R&D and IP factors to test propositions 1 and 2. MTMM results All the variables used to measure the R&D and IP factors from the questionnaire-based survey that correlated significantly with our web-based indicators were selected. As explained in section 3.3, the MTMM compares different combinations of methods that measure the same traits (factors). Thus, all possible combinations of traits and methods need to be tested. In the case of the web- based indicators, only one combination of trait and method is used. For the questionnaire-based indicators, three different variables are correlated with the R&D web-based indicator and two different variables are correlated with the IP web-based indicator. As such, six different combi- nations are possible and thus six different matrices are built. To ease the reader’s comprehension of the interpretation of an MTMM matrix, Table 7 shows an example of how the results are dis- played, along with a short analysis of all the different validity tests below the table. Furthermore, all traits are considered to be single items, which implies that we cannot calculate the reliability of the measure. Accordingly, the reliability diagonal will be neglected in our analysis. In all matrices, we can observe that the hetero-trait mono-method value is low and not signifi- cant for the web-based method (rRD1,IP1 = −0.180; p-value > 0.05) and lower than all corresponding
mono-trait hetero-method diagonal values (rRD1,RD2 and rIP1,IP2). This suggests that there is no com-
bination of trait and method effects and that no method bias for the web-based method is present.
The complete analysis of all the different matrices results is shown in Tables 8–13.

Of the six MTMM matrices produced, four matrices yield acceptable results:

(cid:129) Level of importance of contracting R&D with the number of patents (Mesa 9)
(cid:129) Level of importance of providing R&D with the number of IP mechanisms used

(Mesa 10)

(cid:129) Level of importance of providing R&D with the number of patents (Mesa 11)
(cid:129) Proportion of employees assigned primarily in R&D with the number of IP mechanisms

usado (although it is a weak construct) (Mesa 12)

Estudios de ciencias cuantitativas

1617

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

Web method

Questionnaire method

Mesa 7. MTMM matrix output example

Web method

Traits
RD1

IP1

RD2

IP2

RD1
–a

rRD1,IP1

rRD1,RD2

rRD1,IP2

IP1

–a

rIP1,RD2

rIP1,IP2

Questionnaire method
IP2
RD2

–a

rRD2,IP2

–a

Convergence validity: Yes if rRD1,RD2 and rIP1,IP2 are high and significant;

Discriminant validity: Yes if rRD1,RD2 and rIP1,IP2 are bothb > |rRD1,IP2| y |rIP1,RD2|;

Construct validity without either a combination of trait and method effects or a mono-method bias: Yes if rRD1,RD2 and

rIP2,IP2 > |rIP1,RD1| y |rRD2,IP2|.

Notas (common to Tables 8–13):

* pag < 0.05; ** p < 0.01 The rejection of the validity of the level of importance of contracting R&D with the number of IP mechanisms used construct (Table 8) suggests the presence of either a combination of trait and method effects or a mono-method bias, as the correlation between the two items from the questionnaire is stronger than the correlation obtained with their corresponding web-based measure. Thus, we cannot effectively discriminate the two web-based indicators from the questionnaire-based indicators, which implies that they cannot be part of the same construct. Table 8. Level of importance of contracting R&D with the number of IP mechanisms used Web method Traits WEB_RD WEB_IP WEB_RD –a −0.18 WEB_IP Contract_In_R&D nbIP –a Web method Questionnaire method Questionnaire method Contract_In_R&D nbIP 0.283* 0.007 0.209 0.368** –a 0.433** –a Convergence validity: Yes, rRD1,RD2 = 0.283* and rIP1,IP2 = 0.368** are high and significant; Discriminant validity: Yes, rRD1,RD2 = 0.283* and rIP1,IP2 = 0.368** >> |rRD1,IP2| = 0.007 y |rIP1,RD2| = 0.209;

Construct validity without either a combination of trait and method effects or a mono-method bias: No, |rRD2,IP2| = 0.433** >>

rRD1,RD2 = 0.283* and rIP1,IP2 = 0.368**.

Estudios de ciencias cuantitativas

1618

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

Mesa 9.

Level of importance of contracting R&D with the number of patents

Web method

Questionnaire method

Web method

Traits

WEB_RD

WEB_IP

Questionnaire method

Contract_In_R&D

ln(nbPatent)

WEB_RD
–a

−0.18

0.283*

0.044

WEB_IP

Contract_In_R&D

ln(nbPatent)

–a

0.209

0.395*

–a

−0.077

–a

Convergence validity: Sí, rRD1,RD2 = 0.283* and rIP1,IP2 = 0.395* are high and significant;

Discriminant validity: Sí, rRD1,RD2 = 0.283* and rIP1,IP2 = 0.395* >> |rRD1,IP2| = 0.044 y |rIP1,RD2| = 0.209;

Construct validity without either a combination of trait and method effects or a mono-method bias: Sí, rRD1,RD2 = 0.283* y

rIP1,IP2 = 0.395* >> |rIP1,RD1| = 0.18 y |rRD2,IP2| = 0.077.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

The rejection of the discriminant validity between the web-mining method and the questionnaire-
based survey method with the proportion of employees assigned primarily in R&D with the num-
ber of patents (Mesa 13) stems from the fact that the correlation is higher with the IP web-based
indicator than with the R&D web-based indicator. Después de todo, it can be expected that a high propor-
tion of employees dedicated at R&D activities should correlate strongly with the number of patents.
Por lo tanto, these indicators do not discriminate each other, and this construct is thus rejected.

Finalmente, the proportion of employees assigned primarily in R&D with number of IP mechanisms
used passes the validity test because the four correlations are lower than the corresponding
values found in the validity diagonal, which is the accepting criterion even if |rIP1,RD2| = 0.288*

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 10. Level of importance of providing R&D with the number of IP mechanisms used

Web method

Traits

WEB_RD

WEB_IP

WEB_RD
–a

−0.18

WEB_IP

Contract_Out_R&D

nbIP

–a

Web method

Questionnaire method

Contract_Out_R&D

nbIP

0.306*

0.007

0.068

0.368**

–a

−0.066

–a

Convergence validity: Sí, rRD1,RD2 = 0.306* and rIP1,IP2 = 0.368** are high and significant;

Discriminant validity: Sí, rRD1,RD2 = 0.306* and rIP1,IP2 = 0.368** >> |rRD1,IP2| = 0.007 y |rIP1,RD2| = 0.068;

Construct validity without either a combination of trait and method effects or a mono-method bias: Sí, rRD1,RD2 = 0.306* y

rIP1,IP2 = 0.368** >> |rIP1,RD1| = 0.18 y |rRD2,IP2| = 0.066.

Estudios de ciencias cuantitativas

1619

Using web content analysis to create innovation indicators

Mesa 11.

Level of importance of providing R&D with the number of patents

Web method

Questionnaire method

Web method

Traits

WEB_RD

WEB_IP

WEB_RD
–a

−0.18

Questionnaire method

Contract_Out_R&D

0.306**

ln(nbPatent)

0.044

WEB_IP

Contract_Out_R&D

ln(nbPatent)

–a

0.068

0.396*

–a

−0.134

–a

Convergence validity: Sí, rRD1,RD2 = 0.306** and rIP1,IP2 = 0.396* are high and significant;

Discriminant validity: Sí, rRD1,RD2 = 0.306** and rIP1,IP2 = 0.396* >> |rRD1,IP2| = 0.044 y |rIP1,RD2| = 0.068;

Construct validity without either a combination of trait and method effects or a mono-method bias: Sí, rRD1,RD2 = 0.306** y

rIP1,IP2 = 0.396* >> |rIP1,RD1| = 0.18 y |rRD2,IP2| = 0.134.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

y |rRD1,IP2| = 0.284* are reasonably high and significant. Sin embargo, it is determined to be a weak
construct because of the slight differences in correlation.

Although these results are encouraging, the presence of four valid constructs ironically
shows that we cannot isolate precise actions with our web-based indicators (see the results
summary in Table 14). En otras palabras, the “real” meaning of these indicators remains un-
known and, como resultado, proposition 1 is not confirmed. Sin embargo, these measures may need
to be put into a broader perspective to be effective, which may provide a better fit for the scope
allowed by proposition 2. The next steps in this paper test whether the web-based indicators
represent factors that are considered important for the firms.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

Mesa 12.

Proportion of employees assigned primarily in R&D with the number of IP mechanisms used

Web method

Questionnaire method

Traits
WEB_RD

WEB_IP

WEB_RD
–a

−0.18

WEB_IP

propEmp_R&D

nbIP

–a

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Questionnaire method

propEmp_R&D

nbIP

0.306**

0.007

0.288*

0.368**

–a

0.284*

–a

Convergence validity: Sí, rRD1,RD2 = 0.306** and rIP1,IP2 = 0.368** are high and significant;

Discriminant validity: Weak yes, rRD1,RD2 = 0.306* >> |rRD1,IP2| = 0.007, but rRD1,RD2 = 0.306* > |rIP1,RD2| = 0.288* which is also high

and significant; finalmente, rIP1,IP2 = 0.368** >> |rIP1,RD2| = 0.288* y |rRD1,IP2| = 0.007;

Estudios de ciencias cuantitativas

1620

Using web content analysis to create innovation indicators

Mesa 13.

Proportion of employees assigned primarily in R&D with the number of patents

Web method

Traits
WEB_RD

WEB_IP

WEB_RD
–a

−0.18

WEB_IP

propEmp_R&D

ln(nbPatent)

–a

Web method

Questionnaire method

propEmp_R&D

0.306**

0.455**

ln(nbPatent)

0.044

0.396*

–a

0.02

–a

Convergence validity: Sí, rRD1,RD2 = 0.306** and rIP1,IP2 = 0.396* are high and significant;

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

4.3. Testing Proposition 2

Construction of formative indices and new validation construct

Given the vast number of words used to construct the web-based indicators, as we concluded in
the previous section, treating each questionnaire-based variable individually may not be appro-
priate. To illustrate the large lexical field of possible words related to the factors studied, y para
properly test proposition 2, it is conceptually sound to build one single measure: one formative

Mesa 14.

Summary of construct validity results

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

Mesa
8

Level of importance of contracting

R&D with the number of
IP mechanisms used

Level of importance of contracting
R&D with the number of patents

Level of importance of providing

R&D with the number of
IP mechanisms used

Level of importance of providing

R&D with the number of patents

Prop. of employees assigned primarily

in R&D with the number of
IP mechanisms used

Prop. of employees assigned primarily
in R&D with the number of patents

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Convergent
validity
Sí

Discriminant
validity
Sí

Construct validity without either a
combination of trait and method
effects or a mono-method bias
No

Sí

Weak yes

Sí

Estudios de ciencias cuantitativas

1621

Using web content analysis to create innovation indicators

index with all the questions related to R&D and IP. Because the PCA performed on all the items
related to R&D and to IP presented at the end of section 3.1 did not produce any significant KMO
and Cronbach’s alpha measures, the use of a formative index comprising several subelements
explaining our R&D factor in its broadest sense may be more appropriate.

Partial least square (PLS) regressions were estimated to determine whether it is possible to
create valid formative indices for R&D and IP. To use PLS regressions, the methodology requires
only the use of complete data sets (nelson, taylor, & MacGregor, 1996). Nonresponse is usually
treated by either weight adjustment (es decir., deleting incomplete data entry and weighting remaining
respondents to compensate for the deletion) or imputation (es decir., adding artificial values based on
average by classes and editing methods (Särndal, Swensson, & Wretman, 1992) to replace the
missing values) (Haziza & Beaumont, 2007). As our sample size for IP is already low for one of
the items (39 for the number of patents), we could not afford to treat the missing data with a weight
adjustment. Respectivamente, we replaced the missing data with their imputation class based on
control variables. We sorted the firms by sector, then by number of employees, and then by
revenue. Depending on the situation, we used the mean of the class or the most conservative
nearest-neighbor, a method commonly used in the literature (Haziza & Beaumont, 2007;
Pequeño, 1986; Thomsen, 1973).

Because not all the items shared the same scale, we transformed each variable into a Z-score.
PLS regressions were then estimated using WarpPLS 5.0 software with the following settings:
MODEL B BASIC Warp3 Stable 3 and MODEL B BASIC Linear Stable 3. The two different settings
produced the same conclusions. The details of the construct comparing the web mining tech-
nique and the questionnaire-based survey are shown in Table 15 (the PLS regressions and the
resulting weights used to build these indices are provided in Appendix 4).

All weights are significant ( pag < 0.01); indicator weight-loading signs are all positive; variance inflation factors ( VIF) are all very low (< 1.5); and the effect sizes (ES) are all greater than 0.02. All the criteria are met to indicate that the indices generated are valid (Cenfetelli & Bassellier, 2009; Cohen, 1988; Diamantopoulos, 1999; Diamantopoulos & Siguaw, 2006; Diamantopoulos & Winklhofer, 2001; Petter, Straub, & Rai, 2007). For each factor, the sum of each weighted variable generated both indicators R&D_INDEX and IP_INDEX. Table 15. The validation construct for proposition 2 testing Concepts R&D Web mining variables ln( WEB_R&D) PLS-built indices R&D_INDEX Questionnaire variables Z_nb_R&D_proj Z_Int_R&D_source Z_Ext_R&D_source Z_Contract_In_R&D Z_Contract_Out_R&D Z_time_R&D Z_propEmp_R&D IP Inv( WEB_IP) IP_INDEX Z_nbIP Z_nbPatent Note: All variables are continuous and normal. Quantitative Science Studies 1622 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Final MTMM analysis This final MTMM matrix includes the two web-based indicators along with the R&D_INDEX and IP_INDEX (see Table 16). Once again, the reliability diagonal will be neglected in our analysis, as the measures are made with single items from the web method and with formative indices from our questionnaire. The mono-trait hetero-method diagonal shows high and significant correla- tions for R&D (r = 0.419; p < 0.01) and for IP (r = 0.520; p < 0.01), which hints at strong convergent validity. The hetero-trait mono-method value is low and not significant for the web content anal- ysis method (r = 0.182; p > 0.05), although the questionnaire-based survey method value is high
and significant (r= 0.320; pag < 0.01). However, the mono-trait hetero-method values (r = 0.419 and r = 0.520) for R&D and IP respectively are much higher than the hetero-trait mono-method values (r = 0.320 and r = 0.182), which indicates no combination of trait and method effects and no mono-method biases. The first hetero-trait hetero-method value is low and not significant (r = −0.017; p > 0.05), and the other is moderate and significant (r= 0.294; pag < 0.05). However, and more importantly, the correlations are much lower than the corresponding values found in the validity diagonal, which shows good discriminant validity. As all the conditions are satisfied under the original guidelines proposed by Campbell and Fiske (1959), no risk of potential biases is induced within the methods, the traits, or a combination of both. The results based on this meth- odology suggest that our web-based indicators reflect the importance given to innovation factors such as R&D and IP. It is also worth mentioning that the validity of this construct appears to be stronger than all the others attempted in section 4.2. 4.4. Post Hoc Analysis: MTMM Confirmatory Factor Analysis Some method effects can be induced independently “among the items within or between constructs” (Ortiz de Guinea et al., 2013, p. 839) and “based on the nature of the rater, item, construct, and/or context” (Richardson, Simmering, & Sturman, 2009, p. 766). This could mean that one or more traits do not contribute or contribute negatively to the construct, which could l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d / . Table 16. MTMM matrix for R&D_INDEX and IP_INDEX Web method Questionnaire method Web method Traits WEB_RD WEB_IP WEB_RD –a −0.182 WEB_IP R&D_INDEX IP_INDEX –a f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Questionnaire method R&D_INDEX IP_INDEX 0.419** −0.017 0.294** 0.520** –a 0.320** –a Convergence validity: Yes, rRD1,RD2 = 0.419** and rIP1,IP2 = 0.520** are really high and significant; Discriminant validity: Yes, rRD1,RD2 = 0.419** and rIP1,IP2 = 0.520** >> |r RD1,IP2| = 0.017 y |rIP1,RD2| = 0.294**;

Construct validity without either a combination of trait and method effects or a mono-method bias: Sí, rRD1,RD2 = 0.419** y

rIP1,IP2 = 0.520** >> |rIP1,RD1| = 0.182 y |rRD2,IP2| = 0.320**.

Notas: aAll traits are measured by single items, no reliability statistic can be calculated.

* pag < 0.05; ** p < 0.01 Quantitative Science Studies 1623 Using web content analysis to create innovation indicators Figure 2. MTMM CFA with formative indexes. ultimately affect the correlations found in the MTMM matrix performed in the previous step. Thus, if these method effects are significant, a congeneric bias is present. One way to verify the effect of each item individually is to extend the MTMM matrix with a CFA, which allows the estimation of latent factors within the MTMM matrix (Maas, Lensvelt-Mulders, & Hox, 2009). Using two PLS regressions allows traits and method factors to load on the items of the estimated constructs. Convergent validity is obtained when the trait loadings are higher than the method loadings. A possible method effect might thus be induced if a method load is higher than the related trait. The discriminant validity is obtained when correlations between the traits are low or moderate. The model tested for this analysis is presented in Figure 2. MTMM CFA results are presented in Table 17. All the traits load on every item with strong values (all above 0.8). Most of the value loadings for the methods are strong, but generally lower Questionnaire-based survey items measuring R&D R&D1 (Z_propEmp_R&D) Table 17. MTMM CFA results Items Trait loading 0.825 Method loading 0.793 R&D2 (Z_nb_R&D_proj) R&D3 (Z_time_R&D) R&D4 (Z_Ext_R&D_source) R&D5 (Z_Contract_Out_R&D) R&D6 (Z_Contract_In_R&D) Questionnaire-based survey items measuring IP IP1 (Z_nbIP) IP2 (Z_nbPatent) Web mining item measuring R&D WEB_R&D (Z_WEB_R&D) Web mining item measuring IP IP_WEB (Z_WEB_IP) 0.921 0.848 1.000 0.983 0.914 0.900 1.000 0.996 0.947 0.984 0.910 0.984 0.469 0.905 0.993 0.978 0.915 0.923 Quantitative Science Studies 1624 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators Table 18. MTMM CFA correlations R&D_INDEX 1.000 0.273* IP_INDEX 1.000 R&D_INDEX IP_INDEX * p < 0.05. ** p < 0.01. than the trait loading, indicating convergent validity and no congeneric bias. Possible method effects are observed with three items (R&D2, R&D3, and IP1) which load higher with the method than with the trait. It is possible that our use of formative indices may have had an influence on the measure, thus creating a method bias flagged by the CFA results. However, 7/10 (70%) of our results respect the condition, which is considered a high enough proportion to suggest an overall absence of method bias in the construct (Ortiz de Guinea et al., 2013). Finally, comparing both traits (as set out in Table 18) shows a moderate correlation between the two formative indices, which indicates good discriminant validity. The results from this post hoc analysis confirm the results obtained using the MTMM matrix alone. Our web-based indicators seem to effectively reflect the importance that technological firms grant to R&D and IP. 5. DISCUSSION AND IMPLICATIONS FOR RESEARCH We used web content analysis to build innovation indicators from the web text content of four factors that are consequential for the success of high-technology innovation, R&D, IP, collabora- tion, and external financing. We then validated the true nature of these indicators, by determining whether they were valid substitutes for specific indicators that would otherwise have required a questionnaire-based survey to be obtained. To better understand the nature of the data extracted, two propositions were extensively tested. Our first proposition stipulated that our web-based indicators could effectively measure specific actions of a company to perform activities related to a given factor. Although significant results were obtained for IP factors and some indicators of R&D, there were no significant corre- lations with either collaboration or external financing factors. For the specific case of R&D, we observed that the web-based indicator seems to reflect the promotion needs of various firms in terms of R&D. Our R&D web-based indicator is most highly correlated with the survey-based indicator related to firms more likely to provide R&D services to third parties. Thus, companies use their websites to promote their R&D service offerings. Furthermore, our web-based R&D indicator is highly correlated with the questionnaire-based indicator associated with firms that have a high percentage of employees allocated to R&D tasks. This can be explained by the willingness of firms to attract new R&D talent via their websites. Finally, our R&D web-based indicator correlated significantly with the questionnaire-based indicator tied to whether firms are more likely to contract R&D services from external providers. It is possible that the more important a firm considers R&D, the more likely it is to use both internal and external resources to achieve its R&D-related objectives, which can be described as open innovation practices (Chesbrough, 2003). However, we did not find any correlation with an important R&D indicator, the number of R&D projects and our R&D web-based indicator, which at first may seem counterintuitive, but reflects a limitation of the indicator by not properly accounting for the size dimension of R&D activities. Quantitative Science Studies 1625 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 4 1 6 0 1 1 8 7 0 9 7 3 q s s _ a _ 0 0 0 8 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Using web content analysis to create innovation indicators In contrast, our IP web-based indicator correlates with both the use of IP mechanisms and patent-related activities. Prior to the protection of an intellectual asset, that is, up to the R&D project level, an idea or future invention remains secret. Once an invention is revealed to the world via the disclosure that is generally associated with formal IP protection mechanisms, these IP protection mechanisms act as a signal for the firm, showing how innovative it is or what it wants the world to see. Given that the firms’ websites are used to publicly disclosing such information, it thus appears to be a promising source of data for IP protection-related data. However promising, these preliminary results required more rigorous testing, beyond the simple correlation test between our indicators. The goal was to identify whether specific variables stand out from the others in terms of validity. To do so, we used MTMM matrices to verify the validity of all the promising constructs to assess whether our questionnaire- based indicators can effectively discriminate each other when compared to the web-based indicators. Of the six MTMM matrices produced, four passed all the required validity tests, and two failed. The fact that we obtained four valid constructs with four different combinations of R&D-IP pairs of different variables demonstrates that we cannot isolate specific actions from the questionnaire- based survey from web-based indicators at this point. Accordingly, at this stage we cannot use the web-based indicator built as a direct substitute for action-related indicators, as good conver- gent and discriminant validity were obtained for three different questionnaire-based indicators with the R&D web-based indicator and two different questionnaire-based indicators with the IP web-based indicator. The keywords mentioned on websites may be used by companies that contract R&D ser- vices offered by third parties, by companies that provide R&D services to other companies, or by companies that have a higher proportion of employees allocated to R&D, or any combi- nation of the three. We cannot separate one action from another. The same applies to IP protection. Because acceptable results have been found for the number of patents that a firm owns and for the total number of mechanisms used to protect IP, it is impos- sible to know precisely what is being measured by our web-based indicator. It is worth men- tioning that neither factor significantly correlates with the other (r = 0.144; p > 0.1), cual
suggests that both variables represent different traits. Por lo tanto, it is not possible at this point
to isolate precise IP-related actions.

The second proposition suggests that our web-based indicators could measure the degree of
importance that a company grants to a given factor. para hacerlo, we combined all the elements
related to a given factor to create two formative indices from the questionnaire-based indica-
tores: R&D_INDEX and IP_INDEX. We repeated the MTMM analysis using these two indicators
and the two web-based indicators and observed the convergent and discriminant validity of
our construct. A post hoc analysis with a CFA confirmed our results. Por lo tanto, our proposition
2 es aceptado. En otras palabras, we can use our web-based indicators as proxies for the relative
importance that a company grants to a given factor (such as R&D and IP). It is possible to
conclude that the more R&D-related keywords a firm uses, the more important it considers
R&D-related activities to be, irrespective of the nature of the activities. The same also applies
to the protection of IP. We know that the more IP protection-related terms are used, the more
actions are taken in this direction. Sin embargo, this methodology does not allow us to precisely
determine the nature of the actions a firm takes, which echoes previous findings by (Gök et al.,
2015). We can only guess the nature of these actions, as the indicator gives positive answers
in both cases because the use of keywords alone ignores the context in which these words
are used.

Estudios de ciencias cuantitativas

1626

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

In a nutshell, our methodology can be used as a valid approach to provide data for future in-
novation and technology management studies for the relative importance given to factors such as
R&D and IP, and to test the validity of the measures thus created. In most questionnaire-based
surveys, this information is gathered using 1 a 7 Likert scale questions. If the goal of a study is to
determine the degree of importance of core factors such as R&D or IP for a firm, the use of web-
based indicators is reasonable. Sin embargo, if the goal is to gather more precise information, como
the specific actions taken by a firm, these web-based indicators lack the necessary context to
behave as expected. Although they may provide complementary information, they cannot be
used as direct proxies.

The fact that we did not obtain significant correlations with collaboration and external funding
suggests erring on the side of caution before using this method on a larger scale. The reasons for
the absence of convergent validity of the web-based indicators towards those generated from the
questionnaire-based survey can be attributed to the small sample size, the inherent characteristics
of the subjects from our sample, the questions used to capture that information within the
questionnaire-based survey, and the keywords used to capture the information from the websites.
De este modo, other tests with different combinations of methods, keywords, and traits need to be explored
to determine whether it is possible to obtain valid and relevant information from companies’
websites.

6. LIMITATIONS AND FUTURE RESEARCH

Given more data, our research would obviously be more robust, especially in terms of verifying
the concept of collaboration and external financing normally addressed by classical methods,
which can be appropriately measured using websites. Por ejemplo, we were unable to crawl data
from all the companies from our survey due to technical limitations, which meant that only 79 afuera
de 89 companies were used in this paper.

Websites are updated from time to time and the information provided changes accordingly,
depending on what companies want to make public. It is thus important to note that a single web
mining crawl might be insufficient to capture all relevant information, given that results are
subject to change as websites are updated. De este modo, a longitudinal study would be required to more
clearly understand how time can influence the traits to be measured; eso es, the content available
on a website and the nature of the website itself. An analysis of the evolution of web-based indi-
cators over time would provide further validity to our methodology.

The inability to measure specific actions constitutes a limitation in our web mining method-
ology. The main problem lies in the lack of context tied to the use of keywords alone, possibly
leading to multiple false positives. Machine learning and deep learning techniques, como
recurrent neural network, natural language processing, or bag-of-words models, are promising
avenues to explore the necessary context surrounding specific concepts to improve the level
of precision of web-based indicators.

Además, we began with theoretical factors for the conceptual framework, then identified
the keywords related to these factors, and finally mined the website for these specific keywords.
An interesting alternative would be to do this the other way around; en otras palabras, to start with
the website content and identify the factors that can be naturally found via unsupervised machine
learning algorithms. The term frequency inverse document frequency technique (TF-IDF) podría
be used to provide insight into the importance of keywords relative to the rest of a document.
N-gram low-frequency words clustering could also be tested to better isolate specific areas of
interés (Precio & Thelwall, 2005).

Estudios de ciencias cuantitativas

1627

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

Company websites are purposely structured in a cooperative and agreeable manner for any-
one seeking information about products, services, activities, etcétera. The self-reporting bias
induced by this methodology is inevitable. Sin embargo, it is important to note that questionnaire-
based surveys and most national official public directories are all also subject to self-reporting
prejuicios. Fortunately, the bias induced by the web mining technique is as much a quality as it is
a flaw, in that it provides insight on how a company wishes to be perceived. De hecho, companies
post what they care about, what is important to them, and who they are as an organization on their
websites. This qualitative information represents the essence of the company. Future research is
needed to determine whether such information could, por ejemplo, be used as a proxy to under-
stand a company’s culture.

Other qualitative data analyses of the websites’ content could be used to reduce the risk of
false positives and to gather more accurate data. The use of indicators based on websites’ text
data will open the door to experimenting with other possible indicators derived from the same
websites, such as the use of colors, pictures, and illustrations; audio and video content; choices of
web design styles; use of modern web technologies; frequency of updates; possible interactions
with visitors through all of the possible calls to action; and many other sources of data we may not
have thought of as yet.

Finding a way to leverage the information from high-tech companies’ websites will enable
innovation management researchers to access to a whole new source of data that is free to
usar, accessible at all times, and in large quantities, which will ultimately facilitate studies in this
campo. Finalmente, the innovation research community is invited to build on this common ground to
create a new systematic way to validate new constructs. If such new innovation indicators from
web-based sources can be validated and clearly understood for what they truly measure, el
burden on firms that are increasingly asked to answer questionnaires (from researchers, industria
asociaciones, or the government), would be considerably reduced. This paper shows that this is a
promising avenue.

CONTRIBUCIONES DE AUTOR

Mikaël Héroux-Vaillancourt: Conceptualización, Curación de datos, Análisis formal, Investigación,
Metodología, Administración de proyecto, Recursos, Software, Validación, Visualización, Writing—
original draft, Escritura: revisión & edición. Catherine Beaudry: Conceptualización, Metodología,
Administración de proyecto, Recursos, Adquisición de financiación, Supervisión, Validación, Visualización,
Escritura: borrador original, Escritura: revisión & edición. Constant Rietsch: Recursos, Software,
Visualización, Escritura: borrador original.

CONFLICTO DE INTERESES

Los autores no tienen intereses en competencia.

INFORMACIÓN DE FINANCIACIÓN

This research project was supported by Social Sciences and Humanities Research Council grants
(number #435-2013-1220, #895-2018-1006) and the Canada Research Chair program.

DISPONIBILIDAD DE DATOS

Data cannot be made available publicly due to the confidentiality contract with the subjects of
este estudio.

Estudios de ciencias cuantitativas

1628

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

REFERENCIAS

Adams, r., Bessant, J., & Phelps, R. (2006). Innovation management
medición: A review. International Journal of Management
Reseñas, 8(1), 21–47. DOI: https://doi.org/10.1111/j.1468
-2370.2006.00119.X

Almind, t. C., & Ingwersen, PAG. (1997). Informetric analyses on the
world wide web: Methodological approaches to “webometrics.”
Journal of Documentation, 53(4), 404–426. DOI: https://doi.org
/10.1108/EUM0000000007205

Archibugi, D. (1992). Patenting as an indicator of technological
innovation: A review. Science and Public Policy, 19(6), 357–368.
DOI: https://doi.org/10.1093/spp/19.6.357

Archibugi, D., & Sirilli, GRAMO. (2001). The direct measurement of tech-
nological innovation in business. In Innovation and enterprise cre-
ación: Statistics and indicators. Proceedings of the Conference
Held at Sophia Antipolis. European Commission (Eurostat), ed.
Luxembourg: European Commission.

Armellini, F., Beaudry, C., & Kaminski, PAG. C. (2017). Open within
a box: An analysis of open innovation patterns within Canadian
aerospace companies. Sinergie Italian Journal of Management,
34(101), 15–36. DOI: https://doi.org/10.7433/s101.2016.02
Armellini, F., Kaminski, PAG. C., & Beaudry, C. (2014). The open
innovation journey in emerging economies: An analysis of the
Brazilian aerospace industry. Journal of Aerospace Technology
and Management, 6(4), 462–474. DOI: https://doi.org/10.5028
/jatm.v6i4.390

Arora, A. (1995). Licensing tacit knowledge: Intellectual property
rights and the market for know-how. Economics of Innovation
and New Technology, 4(1), 41–60. DOI: https://doi.org/10.1080
/10438599500000013

Arora, S. K., Youtie, J., Shapira, PAG., gao, l., & Mamá, t. (2013). Entry
strategies in an emerging technology: A pilot web-based study of
graphene firms. cienciometria, 95(3), 1189–1207. DOI: https://
doi.org/10.1007/s11192-013-0950-7

Arvanitis, S. (2012). How do different motives for R&D cooperation
affect firm performance?—An analysis based on Swiss micro data.
Journal of Evolutionary Economics, 22(5), 981–1007. DOI: https://
doi.org/10.1007/s00191-012-0273-5

Bagozzi, R. PAG., Hacer, y., & Phillips, l. W.. (1991). Assessing construct
validity in organizational research. Administrative Science Quarterly,
36(3), 421–458. DOI: https://doi.org/10.2307/2393203

Bar-Anan, y., & Vianello, METRO. (2018). A multi-method multi-trait test of
the dual-attitude perspective. Revista de Psicología Experimental:
General, 147(8), 1264–1272. DOI: https://doi.org/10.1037
/xge0000383, PMID: 30070579

Baysinger, B., & Hoskisson, R. mi. (1989). Diversification strategy and
R&D intensity in multiproduct firms. Academy of Management
Diario, 32(2), 310–332. DOI: https://doi.org/10.2307/256364
Becheikh, NORTE., Landry, r., & Amara, norte. (2006). Lessons from inno-
vation empirical studies in the manufacturing sector: A systematic
review of the literature from 1993–2003. Technovation, 26(5),
644–664. DOI: https://doi.org/10.1016/j.technovation
.2005.06.016

Belderbos, r., Carree, METRO., & Lokshin, B. (2004). Cooperative R&D
and firm performance. Política de investigación, 33(10), 1477–1492.
DOI: https://doi.org/10.1016/j.respol.2004.07.003

Björneborn, l., & Ingwersen, PAG. (2004). Toward a basic framework
for webometrics. Journal of the American Society for Information
Science and Technology, 55(14), 1216–1227. DOI: https://doi
.org/10.1002/asi.20077

Bozdogan, K., Deyst, J., Hoult, D., & lucas, METRO. (1998). Architectural
innovation in product development through early supplier

integración. R&D Management, 28(3), 163–173. DOI: https://doi
.org/10.1111/1467-9310.00093

Marrón, j. r., Fazzari, S. METRO., & Petersen, B. C. (2009). Financing inno-
vation and growth: cash flow, external equity, and the 1990s R&D
boom. Journal of Finance, 64(1), 151–185. DOI: https://doi.org
/10.1111/j.1540-6261.2008.01431.x

Campbell, C. METRO., Michel, j. o., patel, S., & Gelashvili, METRO. (2019).
College teaching from multiple angles: A multi-trait multi-method
analysis of college courses. Research in Higher Education, 60(5),
711–735. DOI: https://doi.org/10.1007/s11162-018-9529-8
Campbell, D. T., & Fiske, D. W.. (1959). Convergent and discriminant
validation by the multitrait-multimethod matrix. Psicológico
Boletín, 56(2), 81–105. DOI: https://doi.org/10.1037/h0046016,
PMID: 13634291

Carboni, oh. A. (2013). Spatial and industry proximity in collaborative
investigación: Evidence from Italian manufacturing firms. Diario de
Technology Transfer, 38(6), 896–910. DOI: https://doi.org
/10.1007/s10961-012-9279-2

Cebon, PAG., Newton, PAG., & Noble, PAG. (1999). Innovation in firms:
Towards a framework for indicator development. Melbourne
Business School, Working Paper, 99-9.

Cenfetelli, R. T., & Bassellier, GRAMO. (2009). Interpretation of formative
measurement in information systems research. MIS Quarterly, 33(4),
689–707. DOI: https://doi.org/10.2307/20650323

Chesbrough, h. W.. (2003). Open innovation: The new imperative for
creating and profiting from technology. Brighton, MAMÁ: Harvard
Business School Press.

Choi, h., & Varian, h. (2012). Predicting the present with Google
Trends. Economic Record, 88(s1), 2–9. DOI: https://doi.org
/10.1111/j.1475-4932.2012.00809.x

Churchill, GRAMO. A. (1979). A paradigm for developing better measures of
marketing constructs. Journal of Marketing Research, 16(1), 64–73.
DOI: https://doi.org/10.2307/3150876

cohen, j. (1988). Statistical power analysis for the behavioral sciences.

Hillsdale, Nueva Jersey: l. Erlbaum Associates.

cohen, W.. METRO., & Levin, R. C. (1989). Empirical studies of innovation
and market structure. In Handbook of Industrial Organization
(volumen. 2, páginas. 1059–1107). Ámsterdam: Elsevier. DOI: https://doi
.org/10.1016/S1573-448X(89)02006-6

cohen, W.. METRO., & Levinthal, D. A. (1990). Absorptive capacity: A
new perspective on learning and innovation. Administrative
Science Quarterly, 35(1), 128–152. DOI: https://doi.org
/10.2307/2393553

Coombs, r., Narandren, PAG., & Richards, A. (1996). A literature-based
innovation output indicator. Política de investigación, 25(3), 403–413.
DOI: https://doi.org/10.1016/0048-7333(95)00842-X

Crawley, t. (2007). Commercialization of nanotechnology–Key
challenges. Report on the Workshop Organised by Nanoforum,
Helsinki.

Deeds, D. l. (2001). The role of R&D intensity, technical development
and absorptive capacity in creating entrepreneurial wealth in high
technology start-ups. Journal of Engineering and Technology
Management, 18(1), 29–47. DOI: https://doi.org/10.1016/S0923
-4748(00)00032-1

Diamantopoulos, A. (1999). Viewpoint – Export performance mea-
surement: Reflective versus formative indicators. Internacional
Marketing Review, 16(6), 444–457. DOI: https://doi.org
/10.1108/02651339910300422

Diamantopoulos, A., & Siguaw, j. A. (2006). Formative versus
reflective indicators in organizational measure development:
A comparison and empirical illustration. Revista británica de

Estudios de ciencias cuantitativas

1629

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

Management, 17(4), 263–282. DOI: https://doi.org/10.1111
/j.1467-8551.2006.00500.x

Diamantopoulos, A., & Winklhofer, h. METRO. (2001). Index construction
with formative indicators: An alternative to scale development.
Journal of Marketing Research, 38(2), 269–277. DOI: https://doi
.org/10.1509/jmkr.38.2.269.18845

Dosi, GRAMO. (1988). Fuentes, procedures, and microeconomic effects of
innovation. Journal of Economic Literature, 26(3), 1120–1171.
Esposito, mi. (2004). Strategic alliances and internationalisation in the
aircraft manufacturing industry. Technological Forecasting and
Social Change, 71(5), 443–468. DOI: https://doi.org/10.1016
/S0040-1625(03)00002-7

Feldman, METRO. PAG., & Florida, R. (1994). The geographic sources of
innovation: Technological infrastructure and product innovation
in the United States. Annals of the Association of American
Geographers, 84(2), 210–229. DOI: https://doi.org/10.1111
/j.1467-8306.1994.tb01735.x

Fiske, D. w., & Campbell, D. t. (1992). Citations do not solve prob-
lemas. Boletín Psicológico, 112(3), 393. DOI: https://doi.org
/10.1037/0033-2909.112.3.393

Flor, METRO. l., & Oltra, METRO. j. (2004). Identification of innovating firms
through technological innovation indicators: An application to the
Spanish ceramic tile industry. Política de investigación, 33(2), 323–336.
DOI: https://doi.org/10.1016/j.respol.2003.09.009

Fosfuri, A. (2006). The licensing dilemma: Understanding the deter-
minants of the rate of technology licensing. Strategic Management
Diario, 27(12), 1141–1158. DOI: https://doi.org/10.1002
/smj.562

Frear, C. r., & Metcalf, l. mi. (1995). Strategic alliances and technology
redes: A study of a cast-products supplier in the aircraft industry.
Industrial Marketing Management, 24(5), 379–390. DOI: https://
doi.org/10.1016/0019-8501(95)00029-A

Geroski, PAG., Machin, S., & Van Reenen, j. (1993). The profitability of
innovating firms. RAND Journal of Economics, 24(2), 198–211.
DOI: https://doi.org/10.2307/2555757

Gök, A., Waterworth, A., & Shapira, PAG. (2015). Use of web mining
in studying innovation. cienciometria, 102(1), 653–671. DOI:
https://doi.org/10.1007/s11192-014-1434-0, PMID: 26696691,
PMCID: PMC4677352

Greve, h. R. (2003). A behavioral theory of R&D expenditures and
innovaciones: evidence from shipbuilding. Academy of Management
Diario, 46(6), 685–702. DOI: https://doi.org/10.2307/30040661
Griliches, z. (1990). Patent Statistics as Economic Indicators: A
Survey ( Working Paper No. 3301). National Bureau of Economic
Investigación. DOI: https://doi.org/10.3386/w3301

Griliches, z. (1994). Productividad, R&D and the data constraint.
Presidential address, American Economic Association, Bostón,
Enero 4, 1994. Revisión económica estadounidense, 84(1), 115–119.
Griliches, z. (1998). R&D and productivity. chicago, IL: Universidad
of Chicago Press. https://ideas.repec.org/ b/ucp/ bknber
/9780226308869.html, DOI: https://doi.org/10.7208/chicago
/9780226308906.001.0001

Gulek, C. (1999). Using multiple means of inquiry to gain insight into
classrooms: A multi-trait multi-method approach. https://eric.ed
.gov/?id=ED431016

guo, B., Aveyard, PAG., Fielding, A., & suton, S. (2008). Testing the
convergent and discriminant validity of the decisional balance scale
of the transtheoretical model using the multi-trait multi-method
acercarse. Psychology of Addictive Behaviors, 22(2), 288–294.
DOI: https://doi.org/10.1037/0893-164X.22.2.288, PMID:
18540726

Hagedoorn, J., & Cloodt, METRO. (2003). Measuring innovative performance:
Is there an advantage in using multiple indicators? Política de investigación,

32(8), 1365–1379. DOI: https://doi.org/10.1016/S0048-7333(02)
00137-3

Hagedoorn, J., Link, A. NORTE., & Vonortas, norte. S. (2000). Research partner-
buques. Política de investigación, 29(4), 567–586. DOI: https://doi.org
/10.1016/S0048-7333(99)00090-6

Hair, j. F., Negro, W.. C., Babin, B. J., anderson, R. MI., & Tatham, R. l.
(1998). Multivariate data analysis (volumen. 5). Upper Saddle River, Nueva Jersey:
Prentice Hall.

Sala, B. h. (1990). The impact of corporate restructuring on industrial
research and development. Brookings, Enero 1. https://www
.brookings.edu/bpea-articles/the-impact-of-corporate-restructuring
-on-industrial-research-and-development/, DOI: https://doi.org
/10.2307/2534781

Harhoff, D., & Körting, t. (1998). Lending relationships in Germany—
Empirical evidence from survey data. Journal of Banking & Finanzas,
22(10), 1317–1353. DOI: https://doi.org/10.1016/S0378-4266(98)
00061-2

Hausman, j. A., Sala, B. h., & Griliches, z. (1984). Econometric
Models for Count Data with an Application to the Patents-R&D
Relationship ( Working Paper No. 17 ). National Bureau of
Economic Research. DOI: https://doi.org/10.3386/t0017

Haziza, D., & Beaumont, J.-F. (2007). On the construction of
imputation classes in surveys. International Statistical Review,
75(1), 25–43. DOI: https://doi.org/10.1111/j.1751-5823.2006
.00002.X

Herrouz, A., Khentout, C., & Djoudi, METRO. (2013). Overview of web
content mining tools. ArXiv:1307.1024 [Cs]. http://arxiv.org/abs
/1307.1024

Hitt, METRO. A., Hoskisson, R. MI., & kim, h. (1997). International diversi-
fication: Effects on innovation and firm performance in product-
diversified firms. Academy of Management Journal, 40(4), 767–798.
DOI: https://doi.org/10.2307/256948

Hwang, D. (2010). Ranking the nations on nanotech | Solid State
Tecnología. http://electroiq.com/ blog/2010/08/ranking-the
-nations/

Hyun Kim, j. (2012). A hyperlink and semantic network analysis of
the triple helix (University-Government-Industry): The interorga-
nizational communication structure of nanotechnology. Diario
of Computer-Mediated Communication, 17(2), 152–170. DOI:
https://doi.org/10.1111/j.1083-6101.2011.01564.x

Johnson, W.. h. A., & Filippini, R. (2009). Internal vs. external col-
laboration: What works. Research-Technology Management,
52(3), 15–17. DOI: https://doi.org/10.1080/08956308.2009
.11657564

Jordán, J., & Lowe, j. (2004). Protecting strategic knowledge:
Insights from collaborative agreements in the aerospace sector.
Technology Analysis & Strategic Management, 16(2), 241–259.
DOI: https://doi.org/10.1080/09537320410001682900

Kalil, t. A. (2005). Nanotechnology and the valley of death.

Nanotechnology Law & Negocio, 2, 265.

katz, j. S., & Cothey, V. (2006). Web indicators for complex innova-
tion systems. Research Evaluation, 15(2), 85–95. DOI: https://doi
.org/10.3152/147154406781775922

kim, J., Sotavento, S., & Marschke, GRAMO. (2014). Impact of university scientists
on innovations in nanotechnology. In S. Ahn, B. Sala, & k. Sotavento (editores.),
Intellectual Property for Economic Development (páginas. 141–158).
Cheltenham: Edward Elgar Publishing. http://www.elgaronline
.com/view/9781782548041.00012.xml, DOI: https://doi.org
/10.4337/9781782548058.00012

Kleinknecht, A., Van Montfort, K., & Brouwer, mi. (2002). The non-
trivial choice between innovation indicators. Economics of
Innovation and New Technology, 11(2), 109–121. DOI: https://
doi.org/10.1080/10438590210899

Estudios de ciencias cuantitativas

1630

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

Klette, t. J., Møen, J., & Griliches, z. (2000). Do subsidies to commer-
cial R&D reduce market failures? Microeconometric evaluation
studies Research Policy, 29(4), 471–495. DOI: https://doi.org
/10.1016/S0048-7333(99)00086-4

Krippendorff, k. (1980). Content analysis: An introduction to its

methodology. Thousand Oaks, California: Sage Publications.

Laursen, K., & Salter, A. (2006). Open for innovation: El rol de
openness in explaining innovation performance among U.K.
manufacturing firms. Strategic Management Journal, 27(2), 131–150.
DOI: https://doi.org/10.1002/smj.507

Sotavento, C.-J., Sotavento, S., Jhon, METRO. S., & espinilla, j. (2013). Factors influencing
nanotechnology commercialization: An empirical analysis of
nanotechnology firms in South Korea. Journal of Nanoparticle
Investigación, 15(2), 1444. DOI: https://doi.org/10.1007/s11051
-013-1444-5

Pequeño, R. j. A. (1986). Survey nonresponse adjustments for estimates
of means. International Statistical Review/Revue Internationale
de Statistique, 54(2), 139–157. DOI: https://doi.org/10.2307
/1403140

Lugtig, PAG. (2017). The relative size of measurement error and
attrition error in a panel survey. Comparing them with new
multi-trait multi-method model. Survey Research Methods,
11(4), 369–382. DOI: https://doi.org/10.18148/srm/2017
.v11i4.7170

Maas, C. J., Lensvelt-Mulders, GRAMO. J., & Hox, j. j. (2009). A multilevel
multitrait-multimethod analysis. Metodología, 5(3), 72–77. DOI:
https://doi.org/10.1027/1614-2241.5.3.72

Mazzoleni, r., & nelson, R. R. (1998). The benefits and costs of
strong patent protection: A contribution to the current debate.
Política de investigación, 27(3), 273–284. DOI: https://doi.org/10.1016
/S0048-7333(98)00048-1

McNeil, R. D., Lowe, J., Mastroianni, T., Cronin, J., & Ferk, D.
(2007). Barriers to nanotechnology commercialization (páginas. 1–57).
College of Business and Management, The University of Illinois
a t S p r i n g f i e l d . h t t p : / / w w w . w i m b . f i n k . r s / d o c s / R e p o r t
-BarriersNanotechnologyCommercialization.pdf

Merges, R. PAG. (1999). Institutions for intellectual property transactions:
The case of patent pools. University of California at Berkeley
Working Paper, 1–74.

Meuleman, METRO., & De Maeseneire, W.. (2012). Do R&D subsidies
affect SMEs’ access to external
financiación? Política de investigación,
41(3), 580–591. DOI: https://doi.org/10.1016/j.respol.2012
.01.001

Michie, j. (1998). Introducción. The Internationalisation of the
Innovation Process. International Journal of the Economics of
Negocio, 5(3), 261–277. DOI: https://doi.org/10.1080
/13571519884387

Miner, GRAMO., Elder, J., Fast, A., Colina, T., Nisbet, r., & Delen, D. (2012).
Practical text mining and statistical analysis for non-structured text
data applications. Nueva York: Prensa académica. DOI: https://doi
.org/10.1016/B978-0-12-386979-1.00020-7, https://doi.org
/10.1016/B978-0-12-386979-1.00026-8

Minguillo, D., & Thelwall, METRO. (2012). Mapping the network structure
of science parks: An exploratory study of cross-sectoral interactions
reflected on the web. Aslib Proceedings, 64(4), 332–357. DOI:
https://doi.org/10.1108/00012531211244716

National Nanotechnology Coordination Office. (2017). Supplement

to the President’s 2018 Budget (pag. 86).

nelson, PAG. R. C., taylor, PAG. A., & MacGregor, j. F. (1996). Missing data
methods in PCA and PLS: Score calculations with incomplete
observaciones. Chemometrics and Intelligent Laboratory Systems,
35(1), 45–65. DOI: https://doi.org/10.1016/S0169-7439(96)
00007-X

OECD & Statistical Office of the European Communities. (2005). Oslo
Manual. https://www.oecd-ilibrary.org/content/publication
/9789264013100-en

OECD & Eurostat. (2019). Oslo Manual 2018. https://www.oecd

-ilibrary.org/content/publication/9789264304604-en

Ortiz de Guinea, A., Titah, r., & Léger, P.-M. (2013). Measure for
measure: A two study multi-trait multi-method investigation of
construct validity in IS research. Computers in Human Behavior,
29(3), 833–844. DOI: https://doi.org/10.1016/j.chb.2012.12.009
parker, h. (2000). Interfirm collaboration and the new product devel-
opment process. Industrial Management & Data Systems, 100(6),
255–260. DOI: https://doi.org/10.1108/02635570010301179
Parthasarthy, r., & Hammond, j. (2002). Product innovation input
and outcome: Moderating effects of the innovation process.
Journal of Engineering and Technology Management, 19(1),
75–91. DOI: https://doi.org/10.1016/S0923-4748(01)00047-9
Pavitt, k. (1985). Patent statistics as indicators of innovative activi-
corbatas: Possibilities and problems. cienciometria, 7(1–2), 77–99.
DOI: https://doi.org/10.1007/BF02020142

Peter, j. PAG., & Churchill, GRAMO. A. (1986). Relationships among research
design choices and psychometric properties of rating scales: A
meta-analysis. Journal of Marketing Research, 23(1), 1–10. DOI:
https://doi.org/10.2307/3151771

Petter, S., Straub, D., & Rai, A. (2007). Specifying formative constructs
in information systems research. MIS Quarterly, 31(4), 623–656.
DOI: https://doi.org/10.2307/25148814

Precio, l., & Thelwall, METRO. (2005). The clustering power of low frequency
words in academic webs. Journal of the American Society for
Information Science and Technology, 56(8), 883–888. DOI:
https://doi.org/10.1002/asi.20177

Ramdani, A. (2014). Revue systématique de la littérature sur les
mesures de la collaboration inter-organisationnelle dans un contexte
d’innovation [Maestros, École Polytechnique de Montréal]. https://
publications.polymtl.ca/1624/

Reinig, B. A., Briggs, R. o., & Nunamaker, j. F. (2007). On the mea-
surement of ideation quality. Journal of Management Information
Sistemas, 23(4), 143–161. DOI: https://doi.org/10.2753/MIS0742
-1222230407

Richardson, h. A., Simmering, METRO. J., & Sturman, METRO. C. (2009). A tale
of three perspectives: Examining post hoc statistical techniques
for detection and correction of common method variance.
Métodos de investigación organizacional, 12(4), 762–800. DOI: https://
doi.org/10.1177/1094428109332834

Rivette, k. GRAMO., & kline, D. (2000). Rembrandts in the attic: Unlocking
the hidden value of patents. Bostón, MAMÁ: Harvard Business School
Prensa.

Roja, A. I., & Nastase, METRO. (2013). Leveraging organizational capa-
bilities through collaboration and collaborative competitive
advantage. Revista de Management Comparat International, 14(3),
359–366.

Särndal, C. MI., Swensson, B., & Wretman, j. (1992). Model assisted
survey sampling. Nueva York: Saltador. DOI: https://doi.org
/10.1007/978-1-4612-4378-6

Straub, D., & Burton-Jones, A. (2007). Veni, vidi, vici: Breaking
the TAM logjam. Journal of the Association for Information
Sistemas; Atlanta, 8(4), 223–229. DOI: https://doi.org/10.17705
/1jais.00124

Straub, D., Limayem, METRO., & Karahanna-Evaristo, mi. (1995). Measuring
system usage: Implications for IS theory testing. Management
Ciencia, 41(8), 1328–1342. DOI: https://doi.org/10.1287
/mnsc.41.8.1328

Stuart, D., & Thelwall, METRO. (2006). Investigating triple helix relation-
ships using URL citations: A case study of the UK West Midlands

Estudios de ciencias cuantitativas

1631

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
4
1
6
0
1
1
8
7
0
9
7
3
q
s
s
_
a
_
0
0
0
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Using web content analysis to create innovation indicators

automobile industry. Research Evaluation, 15(2), 97–106. DOI:
https://doi.org/10.3152/147154406781775968

Teece, D. j. (1986). Profiting from technological innovation:
Implications for integration, colaboración, licensing and public
política. Política de investigación, 15(6), 285–305. DOI: https://doi.org
/10.1016/0048-7333(86)90027-2

Thelwall, METRO. (2009). Introduction to webometrics: Quantitative web
research for the social sciences. Synthesis Lectures on Information
Conceptos, Retrieval, and Services, 1(1), 1–116. DOI: https://doi
.org/10.2200/S00176ED1V01Y200903ICR004

Thelwall, METRO., Buckley, K., & Paltoglou, GRAMO. (2011). Sentiment in Twitter
events. Journal of the American Society for Information Science and
Tecnología, 62(2), 406–418. DOI: https://doi.org/10.1002
/asi.21462

Thomsen, I. (1973). A note on the efficiency of weighting subclass
means to reduce the effects of nonresponse when analyzing survey
datos. Statistisk Tidskrift, 4, 278–283.

Van de Lei, t. MI., & Cunningham, S. W.. (2006). Use of the internet for
future-oriented technology analysis. 2nd International Seville
Seminar on Future-Oriented Technology Analysis: Impact of
FTA Approaches on Policy and Decision-Making (páginas. 28–29).
Seville, España.

Vaughan, l. (2004). Exploring website features for business informa-
ción. cienciometria, 61(3), 467–477. DOI: https://doi.org
/10.1023/b:scie.0000045122.93018.2a

Weare, C., & lin, W.-Y. (2000). Content analysis of the world wide
web: Opportunities and challenges. Social Science Computer
Revisar, 18(3), 272–292. DOI: https://doi.org/10.1177
/089443930001800304

Youtie, J., Hicks, D., Shapira, PAG., & Horsely, t. (2012). Pathways
from discovery to commercialization: Using web sources to track
small and medium-sized enterprise strategies in emerging nanotech-
nológico. Technology Analysis & Strategic Management, 24(10),
981–995. DOI: https://doi.org/10.1080/09537325.2012.724163

APPENDIX 1: RELEVANT QUESTIONS FROM THE QUESTIONNAIRE-BASED SURVEY

R&D

1. How many nanotechnology-related and/or advanced material products in develop-

ment do you actually have in each of the following phases?

(cid:129) Applied Research;
(cid:129) Product Scoping and Business Case Building;
(cid:129) Desarrollo, Testing and Validation;
(cid:129) Commercialisation.

2. How important to your plant’s innovation activities are each of the following sources
of knowledge and innovation? (1–Not important, 2–Very low, 3–Low, 5–High, 6–Very
alto, 7–Essential).

(cid:129) Internal R&D in your firm;
(cid:129) Commercial laboratories/R&D firms/Technical Consultants.

Please indicate the level of importance of each of the following innovation activities to
your plant during the period 2010 a 2014 (1–Not important, 2–Very low, 3–Low, 5–
High, 6–Very high, 7–Essential).

(cid:129) Contracting of external R&D service providers;
(cid:129) Providing R&D services to third parties.

4. How long did it take to develop your most significant and recent (MSR) nanotechnology-