This article will briefly introduce Whoosh, a lightweight search tool in Python, and give sample code for its use.
Whoosh Introduction
Created by Matt Chaput, Whoosh started as a simple, fast search service for online documentation of Houdini 3D animation packages, and has since become a full-fledged search solution and open source.
Whoosh, written purely in Python, is a flexible, convenient, lightweight search engine tool that now supports both Python 2 and 3, with the following advantages:
-
Whoosh is written purely in Python, but is fast and requires only a Python environment and no compiler.
-
Use the Okapi BM25F sorting algorithm by default, other sorting algorithms are also supported.
-
Whoosh creates smaller index files than other search engines.
-
The encoding of the index file in Whoosh must be unicode;
Whoosh can store arbitrary Python objects.
Whoosh's official website Compared to established search engine tools such as ElasticSearch or Solr, Whoosh appears to be lighter and simpler to operate, and can be considered for use in small search projects.
Whoosh Index & query
For those familiar with ES, the two important aspects of search are mapping and query, that is, index construction and querying, behind which are complex index storage, query parsing, and sorting algorithms. If you have experience in ES, it is very easy to get started with Whoosh.
According to my understanding and the official Whoosh documentation, the introductory use of Whoosh is mainly index and query. one of the powerful features of a search engine is its ability to provide full-text search, which depends on sorting algorithms such as BM25, and on how we store fields. So, index as a noun means to index a field, and index as a verb means to create an index of a field. And query takes the statements we need to query and puts them through a sorting algorithm to give us reasonable search results.
Detailed instructions on the use of Whoosh have been given in the official documentation, so I will just give a simple example here to illustrate how Whoosh can easily enhance our search experience.
Sample Code
The sample data for this project is poem.csv
title | dynasty | poet | content |
Hamlet | Tudor dynasty | William Shakespeare | Hamlet... |
Fields
Based on the characteristics of the dataset, we create four fields (fields): title, dynasty, poet, content. the code to create them is as follows:
# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# Create a schema, stored as True means it can be retrieved
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
dynasty=ID(stored=True),
poet=ID(stored=True),
content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)
where the ID can only be a unit value and cannot be split into several words, commonly used for file paths, URLs, dates, and classifications.
The text content of TEXT files, indexing and storage of the text, and support for word search.
Create index file
Next, we need to create the index file. We use the program to parse the poem.csv file and convert it to index and write it to the indexdir directory. the Python code is as follows:
# Parsing poem.csv file
with open('poem.csv', 'r', encoding='utf-8') as f:
texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
# Store schema information in indexdir directory
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
os.mkdir(indexdir)
ix = create_in(indexdir, schema)
# Add the documents to be indexed according to the schema definition information
writer = ix.writer()
for i in range(1, len(texts)):
title, dynasty, poet, content = texts[i]
writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()
When the index is created successfully, the indexdir directory is created, which contains the index files for each field of the above poem.csv data.
Inquiry
Once the index is created successfully, we can use it for querying.
For example, if we want to query for paragraphs with Hamlet in the content, we can enter the following code:
# Create a retriever searcher = ix.searcher() #
Retrieve documents with 'Hamlet' in the contentresults = searcher.find("content", "
Hamlet") print('A total of %d documents were found.' % len(results)) for i in range(min(10, len(results))): print(json.dumps(results[i].fields(), ensure_ascii=False))
The output results are as follows:
{"content": "
Hamlet...", "dynasty": "
Tudor dynasty", "poet": "
William Shakespeare ", "title": "
Hamlet"}