SearchWorks

Technical Glossary

Submitted by sdoljack@stanfo... on Thu, 08/25/2011 - 16:00

Technical Glossary

Glossary of technical terms used to describe searching in SearchWorks. Please see How Searching Works for additional explanations on how these concepts affect relevancy.

Blacklight - discovery interface software. The Blacklight software can be used for any Solr index and is designed to be easily customized. It is community written and maintained as an open source project, and Stanford DLSS programmers have actively been a part of creating and shaping the code. See https://projectblacklight.org/.

boolean - a searching syntax using AND, OR, and NOT as logical operators. NOT works in all contexts in SearchWorks, while AND and OR work in Advanced Search only. The boolean operators must be all capital letters.

document- the unit of data that the search engine works with. Note that the original data is not the same as the data in the search engine; the latter will contain metadata specific to the search engine and may or may not contain the original data in its entirety. It can have full text or images or other data included. Searches return matches to individual records. Synonym for "record."

facet - a search field used to filter results, often appearing in a left navigation bar on the left. For example, in SearchWorks, the Format facet has values Book, Image, etc.

field - fields contain information to be used by the search engine in particular contexts. SearchWorks has fields like publication date, format, multiple kinds of title fields, etc. Fields generally have specific purposes, such as searching or sorting.
- search field - contains information massaged for optimal search behaviors. It might have individual words for searching rather than looking for exact matches on a text string. Individual words might be stemmed so both car and cars would match a search for either.
- sort field - might contain text that skips non-filing characters and that is all lower case. It might contain dates normalized for sorting to all be in the format YYYYMMDD. Sort field values aren't displayed to end users.
- facet field - facet values are generally whole strings. A facet would have "wooden boat" as a value, rather than splitting it into two values "wooden" and "boat."

fielded search - a type of search when only a single field or subset of fields is used for matching. For example, a title search would not look for matches in author or subject fields. An ISBN search looks for matches only in an ISBN field.

hits - see results.

index - the index is what is searched to produce search results. Indexes are generally text stored in a special way to allow for very fast searching. The specific text in the index is determined by defining fields.

indexed - if text has been indexed, then it can be searched. If text is not indexed, then it will never match query terms.

Lucene - search engine software. It is a mature, robust piece of open source code, and is widely used in many different contexts (e.g. Netflix, CNet, AOL, Ticketmaster ...). This is the core of the Solr software; programs interact with directly (not via HTTP). See https://lucene.apache.org/.

mm - "Minimum Must Match" - affects what is considered a hit.
- a way to indicate how many of the terms in a query must match a record for it to be included in the results. See How Searching Works for further explanation.

normalize - making data easier to compare by making it more similar. For example, dates represented as "7/19/2011" and "2010/5/18" might be normalized to "2011-07-19" and "2010-05-18" respectively. Or the "titleInfo" field in a MODS record might become the "title" field in a Solr index to allow comparison with Dublin Core records in the index..

precision - a measurement of search result set accuracy. Perfect precision means all the retrieved documents match the query, but says nothing about the number of matching documents missing from the search results.

query - what the search engine is being asked to match. A query is sent to the search engine, and search results are returned.

recall - a measurement of search result set completeness. Perfect recall means all the matching documents were retrieved, but says nothing about the number of irrelevant documents in the results.

record - see document. "Document" is the Lucene/Solr preferred term, while "record" is the preferred term in the context of Blacklight or other Ruby on Rails applications, as well as in the database context.

record view - in SearchWorks, this is the page when you are looking at a single document or record, e.g. https://searchworks.stanford.edu/view/4286782 Synonym for "show view."

relevance - closely related to precision, this is generally a numeric representation of how well a document matched the query.

relevance ranking - given the numeric representation of relevancy, results can be ordered by their relevancy score. The relevancy ranking is the ordinal representation of relevancy: best match, next best match, etc.

results - the set of documents that match the query. Synonym for hits.

search engine - software used to take a bunch of data (usually text and/or text metadata), make it searchable, and respond to queries with lists of results from the original data.

search results - see results.

show view - see record view.

slop - the distance allowed between consecutive query terms
- query slop - affects whether or not the document is in the search results
  - query slop applies only when there is a phrase (in quotes) in the query
  - For a phrase in query, this is the distance that can separate the query terms.
  - with a setting of 1, the query: "french beans food scares" (with quotes) would match document containing "french beans make food scares" but would not match "french beans can make food scares"
  - our setting is 1.
- phrase slop - affects how high the document is in a set of search results.
  - like query slop, but it only affects the relevancy sorting of the matching documents.
  - ps applies to ALL result sets.
  - our setting is 0.

Solr - search engine software. It is a mature, robust piece of open source code, and is widely used in many different contexts (e.g. Netflix, CNet, AOL, Ticketmaster ...). It is built on top of the Lucene software, providing programmers with easier interactions with Lucene (e.g. HTTP instead of java). See https://lucene.apache.org/solr.

stemming - a way of reducing words to their root (or stem or base) to promote more matches. For example, "riding" and "rides" might both be reduced to "ride" so all of these word variants would match each other. Note that the stemmed text might not be a word, as it is only meant to be used by the search engine. There are different algorithms for stemming, and stemming is different for different languages. Stemming is different from truncation because the stem might not be the exact letters at the beginning of the word, for example, "happy" might stem to "happi" so "happy" and "happiness" would match.

stopword - a word that is not indexed because it occurs so often in the text that it has no effect on search results, e.g. "the" "a" in English. Stopwords are different for different languages. For example, "die" could be a stopword in German, but not in English.

synonyms - strings that are to be considered equivalent by the search engine, such as "ILS" and "Integrated Library System". SearchWorks does not use these at this time.

term - in a query string, the terms are generally separated by whitespace; they are essentially "words." In the search engine, a term is a unique, searchable "word" in the index that may occur in many different documents. We can look at "all the terms in the ____ field" in the index or "all the terms in the index" to see the unique words used in these contexts for all of our documents.

token - when text is "tokenized", it is split up into small pieces, skipping what the search engine would consider insignificant characters, such as most whitespace. The tokens are the pieces that aren't skipped. So "a comparison of Socrates and SearchWorks" would have the tokens "a", "comparison", "of", "Socrates", "and", "SearchWorks." Note that these are NOT the exact strings that will go into the index: usually tokenizing is the first step of processing text to put it in the index.

tokenize - a string may be split up into "tokens" during indexing. This is most commonly done when a string in the document is being parsed and otherwise manipulated to optimize searching. The tokens are usually temporary artifacts as text is changed from long strings into terms in fields in the index. In English and many other languages, tokens are generally separated by non-alphanumeric characters such as whitespace and punctuation; these non-alphanumeric characters would be considered insignificant by the search engine and would be ignored. But it's not quite that simple. Some examples: a decimal number such as 3.14 should be viewed as a single token, while the hyphenated word "power-shot" could become "power" and "shot" or "powershot" or remain "power-shot."

truncation - lopping the end off of a string, generally to promote more matches. This is different from stemming because it will never change any letters in the remaining string, while stemming uses rules that can change letters, e.g. "happy" could become "happi" to match "happy" with "happiness." With truncation, "happy" does not match "happiness", and "happ" matches "happiness" but also matches "happen."

wildcards - a way of broadening the matches on a query by using characters such as * to represent any character string. For example, "woo*" would match "wood", "wool" and "woolen." This type of searching is not currently available in SearchWorks.

Groups:

SearchWorks

SearchWorks

SearchWorks Help

Technical Glossary

Technical Glossary