[Discovered Method] AI-Powered Semantic Search for Your Local Files

Hello everyone,

Prepare for a deep dive. What I’m about to share is not a simple trick; it’s a blueprint for building a tool that will fundamentally change how you interact with your computer. This method allows you to find any file—image, document, or code—not by its name, but by its meaning.

The Problem: You have a PDF of a contract, but you can’t remember if it’s named contract_final.pdf, agreement_v3.pdf, or some_scan_123.pdf. Or you’re looking for a specific photo of a sunset over a beach, buried somewhere in thousands of files. Standard search is useless.

The Solution: We will build a Personal Semantic Index. This is a local database that maps every file on your computer to a rich, descriptive summary generated by a multimodal AI. You then search this index using natural language.

This is not a theoretical concept. This is a practical guide to building it yourself.

The Architecture: A Two-Part System

Our system has two components:

  1. The Indexer (indexer.py): A Python script that crawls specified directories. For each file, it sends the content (or a representation of it) to an AI and asks, “What is this?” It then stores the file path and the AI’s descriptive answer in a local database.
  2. The Searcher (searcher.py): A second script that takes your plain English query (e.g., “find the presentation about Q3 marketing results”), and searches the local database for the most relevant descriptions, returning the exact file paths.

Prerequisites:

  1. Python 3: With pip installed.
  2. An AI API Key: You’ll need an API key from a provider with a powerful multimodal model. Google AI for Developers (for the Gemini models) is perfect for this, as it handles both text and images effectively.
  3. Required Python Libraries: Install them via pip: Bashpip install google-generativeai python-dotenv Pillow

Part 1: The Indexer Script

This script is the heart of our system. It populates our semantic database.

Setup:

  1. Create a file named .env and put your API key in it: API_KEY = 'YOUR_GOOGLE_AI_API_KEY'
  2. Save the following code as indexer.py:

`Pythonimport os
import sqlite3
import google.generativeai as genai
from dotenv import load_dotenv
from PIL import Image
import mimetypes

— Configuration —

load_dotenv()
genai.configure(api_key=os.getenv(“API_KEY”))
DB_FILE = “file_index.db”

IMPORTANT: Add the paths to the directories you want to index here.

Be careful with this! Do not index your entire C: drive on the first run.

Start with a small, specific folder like your Documents or Pictures.

DIRECTORIES_TO_INDEX = [os.path.expanduser(“~/Documents”), os.path.expanduser(“~/Pictures”)]

— AI Model Setup —

model = genai.GenerativeModel(‘gemini-1.5-flash’)

— Database Setup —

def setup_database():
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()
cursor.execute(“”"
CREATE TABLE IF NOT EXISTS files (
id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT NOT NULL UNIQUE,
description TEXT NOT NULL,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
“”")
conn.commit()
return conn

— AI Description Generation —

def get_ai_description(file_path):
try:
mime_type, _ = mimetypes.guess_type(file_path)
if mime_type is None:
return “Unknown file type.”

    print(f"  -> Analyzing ({mime_type})...")

    # Handle Images
    if mime_type.startswith('image/'):
        img = Image.open(file_path)
        prompt = "Describe this image in detail. What is the subject, setting, mood, and content? Mention any text visible."
        response = model.generate_content([prompt, img])
        return response.text

    # Handle Text-based files
    elif mime_type.startswith('text/') or mime_type in ['application/pdf', 'application/msword', 'application/json', 'application/javascript']:
        with open(file_path, 'r', errors='ignore') as f:
            content = f.read(5000) # Read first 5000 chars to avoid overloading
            if not content.strip():
                return "File is empty or contains no readable text."
            prompt = f"Summarize the content and purpose of this document excerpt. Identify key themes, people, or topics:\n\n---\n{content}\n---"
            response = model.generate_content(prompt)
            return response.text
    else:
        return f"File of unhandled type: {mime_type}"

except Exception as e:
    return f"Error analyzing file: {e}"

— Main Indexing Logic —

def index_files(conn):
cursor = conn.cursor()
for directory in DIRECTORIES_TO_INDEX:
print(f"Crawling directory: {directory}")
for root, _, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)

            # Check if file has already been indexed
            cursor.execute("SELECT id FROM files WHERE path = ?", (file_path,))
            if cursor.fetchone():
                continue

            print(f"Found new file: {file_path}")
            description = get_ai_description(file_path)
            
            # Save the new description to the database
            cursor.execute("INSERT INTO files (path, description) VALUES (?, ?)", (file_path, description))
            conn.commit()
            print(f"  -> Indexed with description: {description[:80]}...")

if name == “main”:
db_conn = setup_database()
index_files(db_conn)
db_conn.close()
print(“\nIndexing complete.”)`

Part 2: The Searcher Script

This is your new search interface. Save this code as searcher.py:

`Pythonimport sqlite3
import argparse

DB_FILE = “file_index.db”

def search_index(query):
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()

# We use a simple LIKE query here. The real power comes from the
# richness of the AI-generated descriptions we are searching through.
search_term = f"%{query}%"

cursor.execute("SELECT path, description FROM files WHERE description LIKE ?", (search_term,))

results = cursor.fetchall()
conn.close()

return results

if name == “main”:
parser = argparse.ArgumentParser(description=“Search for files using AI-generated semantic descriptions.”)
parser.add_argument(“query”, type=str, help=“Your natural language search query.”)
args = parser.parse_args()

search_results = search_index(args.query)

if not search_results:
    print("No matching files found.")
else:
    print(f"Found {len(search_results)} matching file(s) for '{args.query}':\n")
    for path, desc in search_results:
        print(f"File: {path}")
        print(f"  AI Description: {desc.strip()}\n")
        print("-" * 20)`

How To Use Your New Superpower

  1. Run the Indexer: Open your terminal and run python indexer.py. This will take time, especially on the first run, as it needs to analyze files and make API calls. Let it run in the background. You can run it periodically to index new files.
  2. Search Your Files: Now, for the magic. In your terminal, simply type: `Bash# To find that contract
    python searcher.py “legal agreement for the house purchase”

To find that sunset photo

python searcher.py “a photo of a dramatic sunset on a sandy beach”

To find a specific piece of code

python searcher.py “python script that uses sqlite”`

You will get a list of file paths that match the concept of your query, not just the keywords. This is a fundamentally new and powerful way to interact with your own data. This isn’t just a trick; it’s a new paradigm for personal file management. Enjoy.

2 Likes