Data Extraction & OCR Intern (Short-Term Project, 3–5 Weeks)
Location: Remote
Company: HouseNovel
Duration: 3–5 weeks (project-based)
Hours: 10–15 hours per week (flexible)
Compensation: $15–$25 per hour (depending on experience)
About HouseNovel
HouseNovel is a platform that helps homeowners and real estate professionals discover and share home history (think Ancestry for homes!). We are expanding our automated home history reports to include previous homeowners and are looking for an intern with data extraction and OCR (Optical Character Recognition) experience to help us digitize historical homeowner records from Hennepin County city directories.
This is a short-term, project-based internship designed to test and refine our homeowner record extraction process. If successful, there may be future opportunities for additional projects.
What You’ll Do
The primary goal of this project is to extract historical homeowner information from scanned Minneapolis City Directories (starting with 1950) and store it in a searchable database for integration into HouseNovel’s automated home history reports.
Your key responsibilities will include:
- Developing a strategy for extracting text from city directory images using OCR tools such as Tesseract, Google Vision API, AWS Textract, or another tool you recommend
- Cleaning and processing extracted text to ensure accuracy and readability
- Developing Regular Expressions (RegEx) or scripts to structure extracted data efficiently
- Storing extracted data in a PostgreSQL/MySQL database or recommending a more efficient alternative
- Collaborating with our PHP developer to integrate the extracted data into HouseNovel’s home history reports
- Testing and refining the extraction process to ensure scalability for future directories
How This Project Will Be Accomplished
While we have identified a general workflow for this project, we are open to your input on optimizing the process. If you have a more efficient approach, we encourage you to suggest it.
Here is an initial outline of the process:
Step 1: Extract Text from the Directory Pages (OCR Processing)
Since the city directories are stored as scanned images, OCR is likely required to extract the text. You will:
- Evaluate the best OCR approach (Tesseract OCR, Google Vision API, AWS Textract, or another tool you suggest)
- Run tests on sample pages to determine the most accurate method
- Preprocess the OCR output (clean formatting, remove noise, correct errors)
Step 2: Identify and Extract Key Information
Once the text is extracted, the script should parse the following:
- 🟡 Occupant Name – First text entry per row
- 🟢 Occupation – Job titles, typically following the name
- 🟣 Street Number – Numeric house number
- 🟠 Street Name – Extracted from the address format
A proposed approach is to use Regular Expressions (RegEx) or another parsing technique to clean and structure the data. If you have experience with natural language processing (NLP), that could be a useful alternative.
Step 3: Store Data in a Database
Once extracted, the data needs to be stored in a structured format. You will:
- Create a PostgreSQL/MySQL database to store the homeowner records or recommend an alternative approach if more efficient
- Write a Python script to automate data insertion into the database
Step 4: Integrate Data into HouseNovel Reports
You will work with our PHP developer to:
- Build a “Previous Homeowners” page that displays historical homeowner records
- Implement a search function for users to look up past homeowners by address
- Format the data for easy readability within our automated home history reports
What We’re Looking For
- Current student or recent graduate in computer science, data science, information systems, or a related field
- Experience with Python for text processing and automation (Pandas, SQLAlchemy preferred)
- Familiarity with OCR tools such as Tesseract, Google Vision API, AWS Textract, or similar
- SQL database knowledge (PostgreSQL or MySQL preferred, but open to alternatives)
- Strong problem-solving mindset and willingness to propose efficient solutions
- Ability to work independently and meet deadlines in a remote setting
Time Commitment
We anticipate this project will take approximately 25–40 hours total, spread over 3–5 weeks with a flexible schedule. We will work with you to set milestones and check-ins based on your availability.
Why Join Us?
- Gain real-world experience in data extraction, AI-powered OCR, and database management
- Work on historical data, helping build a home history research platform
- Opportunity for future employment or freelance work if the project expands
- Flexible schedule with remote work and weekly check-ins
How to Apply
Send your resume, a brief cover letter, and any relevant projects (OCR, data extraction, or Python work) via Handshake or email amanda.zielike@housenovel.com.
Bonus consideration will be given to candidates who include a small Python script they have written for text parsing or data extraction.