This article will briefly introduce Whoosh, a lightweight search tool in Python, and give sample code for its use.

Whoosh Introduction

  Created by Matt Chaput, Whoosh started as a simple, fast search service for online documentation of Houdini 3D animation packages, and has since become a full-fledged search solution and open source.
  Whoosh, written purely in Python, is a flexible, convenient, lightweight search engine tool that now supports both Python 2 and 3, with the following advantages:

  • Whoosh is written purely in Python, but is fast and requires only a Python environment and no compiler.

  • Use the Okapi BM25F sorting algorithm by default, other sorting algorithms are also supported.

  • Whoosh creates smaller index files than other search engines.

  • The encoding of the index file in Whoosh must be unicode;
    Whoosh can store arbitrary Python objects.

  Whoosh's official website  Compared to established search engine tools such as ElasticSearch or Solr, Whoosh appears to be lighter and simpler to operate, and can be considered for use in small search projects.

Whoosh Index & query

  For those familiar with ES, the two important aspects of search are mapping and query, that is, index construction and querying, behind which are complex index storage, query parsing, and sorting algorithms. If you have experience in ES, it is very easy to get started with Whoosh.
  According to my understanding and the official Whoosh documentation, the introductory use of Whoosh is mainly index and query. one of the powerful features of a search engine is its ability to provide full-text search, which depends on sorting algorithms such as BM25, and on how we store fields. So, index as a noun means to index a field, and index as a verb means to create an index of a field. And query takes the statements we need to query and puts them through a sorting algorithm to give us reasonable search results.
  Detailed instructions on the use of Whoosh have been given in the official documentation, so I will just give a simple example here to illustrate how Whoosh can easily enhance our search experience.

Sample Code

The sample data for this project is poem.csv

title dynasty poet content
Hamlet Tudor dynasty William Shakespeare Hamlet...


  Based on the characteristics of the dataset, we create four fields (fields): title, dynasty, poet, content. the code to create them is as follows:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# Create a schema, stored as True means it can be retrieved
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
                content=TEXT(stored=True, analyzer=ChineseAnalyzer())

where the ID can only be a unit value and cannot be split into several words, commonly used for file paths, URLs, dates, and classifications.
The text content of TEXT files, indexing and storage of the text, and support for word search.

Create index file

  Next, we need to create the index file. We use the program to parse the poem.csv file and convert it to index and write it to the indexdir directory. the Python code is as follows:

# Parsing poem.csv file
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

# Store schema information in indexdir directory
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
ix = create_in(indexdir, schema)

# Add the documents to be indexed according to the schema definition information
writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)

When the index is created successfully, the indexdir directory is created, which contains the index files for each field of the above poem.csv data.


  Once the index is created successfully, we can use it for querying.
  For example, if we want to query for paragraphs with Hamlet in the content, we can enter the following code:

# Create a retriever
searcher = ix.searcher()

# Retrieve documents with 'Hamlet' in the content
results = searcher.find("content", "Hamlet")
print('A total of %d documents were found.' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The output results are as follows:

{"content": "Hamlet...", "dynasty": "Tudor dynasty", "poet": "William Shakespeare ", "title": "Hamlet"}

Related articles

Implementing reverse proxies with Django only

When you think of reverse proxies, you say nginx. nginx is the ideal reverse proxy tool. But now the conditions are tough. The server doesn't have nginx and doesn't have root privileges, which means you can't compile and install nginx, and only one port,

What does join mean in python?

Python has two functions .join() and os.path.join(), which do the following: . join(): Concatenate string arrays. Concatenates the elements of a string, a tuple, and a list with the specified characters (separator) to generate a new string os.path.join()

Add new content to the python dict

Adding new content to a dictionary is done by adding new key/value pairs, modifying or deleting existing key/value pairs as in the following examples

How to delete elements in python dict

clear() method is used to clear all the data in the dictionary, because it is an in-place operation, so it returns None (also can be interpreted as no return value)

python crawler how to get cookies

Cookie, refers to the data (usually encrypted) that some websites store on the user's local terminal in order to identify the user and perform session tracking. For example, some websites require login to access a page, before login, you want to grab the

How OpenCV tracks objects in video

Each frame of the video is a picture, tracking an object in the video, decomposition down, in fact, is to find that object in each frame of the picture.

How does FastAPI close the interface documentation?

FastApi comes with interface documentation, which saves us a lot of work when developing back-end interfaces. It automatically identifies the parameters of the interface based on your code and also generates a description of the interface based on your