{Project Outcome Abstract}
The goal of the Million Quote Project is to collect and publish with pictures 1 million quotes.
I will be using agile methodology to quickly get a site up and running and then over iterations refine the site and content to meet the project objectives.
Remember the goal of agile is that I am not trying to deliver a perfect product straight away, but steady and meaningful progress which is more important than perfection.
Here is my take on agile development or design – it ain’t got to be pretty to start with, it just has to work.
Any by work, means you are shipping bug free designs. I have seen a lot of agile projects, use agile as a way to ship crap and then fix it later…maybe fix it later. Meaning there are a lot of bugs. It has to work – because then you are spending your time fixing things that should have been right in the first place, instead of making the product better.
I like agile because, well you really don’t know what you made will work or not, you are just coming up with theories . And there is that word – “work”. Sure, the product will work, meaning bug free, it will do what it is designed to do… but when it gets in the hands of the user will it meet some larger more abstract goal?
Agile also causes you to ship something. Instead of projects being worked on for years… and then to launch only to find out… well, perhaps we are a day late and a dollar short. Agile allows you to jump in and start testing the waters. Seeing what works. And What is not working. It causes you to be flexible.
Enough on that – to start the project, I have created some project objects, which will serve as guide posts.
{Project Objectives}
{Website}
- As fast as possible pages in under 3 seconds.
- Will handle over 1M pages of content and larger number of users at the same time.
- Cost Effective Hosting
- SEO Optimized
- Sitemap is required
- Human readable URL
- Images will have ALT tags and Title
- Other SEO Features as determined
- Requires Search Feature
- Should be easy to expand as needed
- Support Ads if required
- Browse Quotes by topic
{Website Design}
- Top Bar- Common to Each Page
- Logo and Menu
- Home Page
- {Intro hero banner}
- {Search}
- {Browse By Topic}
- About Page
- Contact Info
- About The Project
- Quote Pages
- Each quote will have their own page
- Along with related media content
- There will be a side bar
- Side bar will contain search
- Listing of topics
- Related Quotes
- {Future} Allow for animated GIF or other video format
{Content}
- There should not be duplicate quotes.
- {Organize} Quotes will have a topic assigned
- {Organize} Quotes can be in more than one topic category
- {Organize} Quotes will have author attribute
- {Quote Images} More than one version
- {Quote Images} Provide images resized for popular Social Media sizes.
- {Quote Images} Contain website branding
Million Quote Project – Step One – Figure Out What Needs to Happen
“A journey of a thousand miles begins with a single step” ~Chinese proverb
I need to figure out where I am and where I are going, what I got and what I need. Kind of like starting out on a quest.
What do I have in my bag to start with:
I have in various formats, (xls, csv, and text files) a collection of quotes. How many quotes, don’t know. I am sure there are duplicates in there.
Everything else is open road.
First tasks:
- Get the content cleaned and into a database for ease of use
- No duplicate quotes
- Decided on what and how to host the content.
- A frame work that is fast and easy to expand
- Cost effective hosting
- Develop the first iteration of the site
- Home page
- About page
- A few (10) quote topics (50 – 60) Quotes max
- Test first iteration
- Publish first iteration
Million Quote Project – Starter Content
Examining the starter quote content:
To start the project, I have 144 files of quotes. Most are in excel format, others are in CSV, and flat text files.
Each quote is on its own line.
The files are not all the same, with the headers different, along with format. It some the Author name is in two columns, (First, Last) in others it is all just in one column.
Out of the pile of content – what do I really need?
What I really care about from the content is – who said it and what was said. If it has the topic associated with the quote – that is great, but some files are missing this completely.
First:
I am going to is slice up the content. Sure I could just import the files directly into a database. But I want to make sure there are no duplicate records.
Also I want to create a pipeline that will process content into the database – without having reinvent the entire wheel each time.
Since the files are starter content, and I am not even sure I will ever receive any files again like this.. Or even in the same formatting, I decided to:
- Create a JSON file for each quote in the collection
- Extract only the Quote, Author, and Topic if avaible.
- Since this is a one off – I will not create an entire process around carving up this content.
I am not going to spend time creating an automatic process that splits out the data as I am not even sure if I will ever receive content like this again. Thus I am viewing it has two separate items for now:
- Getting the data ready for ingesting into the database
- The process of importing records into the database
Once we have the incoming data into a standard format, then The process of importing records into the database can get created as an automatic process, since everything it will be working on will be the same. And used over and over..
If I knew I would get some number of csv files a week, that would all be the same format, and so on.. Then it would make sense to spend that time and define an automatic process around that. But for now, we have a bunch of files that are different.
Million Quote Project – Starter Content – Slicing it up
I looked over the starter content files. Made some edits in the headers to make them as all as like as possible. I grouped like files into folders.
Now I want to extract the quotes from the files and generate JSON files for the database ingestion process.
To do this – I am using PyCharm in Scientific mode – this allow using of code cells to write and test code and make sure your results are what you are looking for. It is kind of like Jupyter Notebook, but in an IDE.
To me – this is a great method of being able to extract data – look at it – and make sure it is working like it should, before letting it rip through everything. Because automation is great…however automation also can mess up a lot of things every quickly.
How am I doing this
I am reading the files into a pandas data structure. From there I am extracting and/or combining the fields based on the format of the input file.
For files that are similar – I am looping over the list of files.
I am generating a UUID for the JSON file – so that I end up with no duplicate file names.
The extracted data is stored in a dictionary and then exported to the JSON file.
I now have over 350,000 JSON files (That are all the same format) ready for import into the database.