Scrape, Scrape

My book has been getting more and more reviews and ratings on Amazon and Goodreads. Instead of obsessively refreshing the page I figured it might be cool to collect reviews from different sources and have them sent as a summary email each day.

I thought I’d start with Amazon. I just made a simple little scraping app. It doesn’t actually send anything to me yet, it only scrapes all reviews from the given book ASIN and shows them on a page (I decided this was a good time to also try Vue.js for the first time).

I will list next steps for the little app below, but for now I figured I’d jot down some notes on how I implemented the first very basic scraping of Amazon reviews.

Scraping Amazon: Bad?

Considering I publish my stuff on Amazon, I don’t want to make them mad with my scraping shenanigans… Which is why i checked robots.txt. There, I found that they do not disallow scraping product review URLs. Except for this one very specific, very mysterious bag:

Forbidden ASIN

The Mystery Bag

So considering they explicitly disallow this one product and allow everything else, I figure it’s fine. I solemnly swear that I did not scrape the forbidden bag.

Server

I won’t go through the server setup here since the main point is the scraping parts. Suffice to say I start a server with one handler which takes a book ASIN and returns all the review and rating info it could find for that book.

Harvester

My idea is to actually eventually have a little toolbox of sorts for writers (well, mostly for me), where people can go to help automate things that they normally go through in the publishing process. This review idea is just one of those tools. I started by making a review package in my QuillTools repository (tentative name for the site). Inside there, I defined a Harvester.

type Harvester struct {
	scrapers []scraper.Scraper
}

func NewHarvester() *Harvester {
	amzn := scraper.NewAmazon()
	return &Harvester{scrapers: []scraper.Scraper{amzn}}
}

func (h *Harvester) Harvest(asin string) ([]*scraper.Result, error) {
	var results []*scraper.Result
	for _, s := range h.scrapers {
		result, err := s.Scrape(asin)
		if err != nil {
			return nil, fmt.Errorf("failed to scrape result from %s: %w", s.Name(), err)
		}
		results = append(results, result)
	}
	return results, nil
}

As we can see, the harvester keeps a slice of Scrapers. Right now the only scraper I have implemented is Amazon. The scraper package lives under review.

Here is the general structure we’ll return back to the caller from each scraper:

type Source string

const (
	SourceAmazon    Source = "amazon"
	SourceGoodreads Source = "goodreads"
)

type Scraper interface {
	Scrape(id string) (*Result, error)
	Name() Source
}

type Result struct {
	SourceName   Source
	TotalRating  Rating
	RatingsCount int
	Reviews      []Review
}

type Review struct {
	ReviewerName string
	Rating       Rating
	Text         string
	ReviewedAt   time.Time
}

type Rating struct {
	Val float64
	Max int
}

Every scraper will return a Result, which will contain total rating information and all the reviews it was able to find. We have to remember that there can be more ratings than there are reviews. I was originally going to try to return the individual ratings as well, but some sources don’t expose this information, so I thought I’d return the total rating and all the reviews instead.

Amazon scraper

This part is very rough, I just hacked around with what elements I’m looking for until it worked, but eventually I’ll need to do a proper pass on this and make sure I’m scraping in the most efficient way.

The Amazon scraper will be defined like this (along with some constants):

const (
	maxAmazonRating  = 5
	defaultAmazonURL = "https://www.amazon.com"
	amazonTimeLayout = "January 2, 2006"
)

type Amazon struct {
	domainURL string
	lock      *sync.RWMutex
}

func NewAmazon() *Amazon {
	return &Amazon{
		domainURL: defaultAmazonURL,
		lock:      &sync.RWMutex{},
	}
}

func (a *Amazon) Name() Source {
	return SourceAmazon
}

I’m doing some async scraping, hence the lock; I’ll also be using httptest to test the external call to Amazon, hence having the domainURL field on the struct instead of just using the const. I wrote a post with more details about that here.

The Scrape() method does all the heavy lifting:

func (a *Amazon) Scrape(asin string) (*Result, error) {
	u, err := url.Parse(a.domainURL)
	if err != nil {
		return nil, fmt.Errorf("failed to parse domain URL: %w", err)
	}
	res, err := a.collectMetadata(u.Host, asin)
	if err != nil {
		return nil, fmt.Errorf("failed to collect metadata: %w", err)
	}
	allReviews, err := a.collectReviews(u.Host, asin)
	if err != nil {
		return nil, fmt.Errorf("failed to collect reviews: %w", err)
	}
	logrus.Infof("Returning %d reviews", len(allReviews))
	res.Reviews = allReviews
	return res, nil
}

I am using colly to handle the scraping. Colly expects an allowed domain to be provided without a scheme, hence the URL parsing at the top.

I started off creating a single Colly collector when instantiating the Amazon scraper and using it to do all of the scraping, but later I decided to opt for creating a separate collector for scraping metadata (which will include things like the total rating and rating count) and another one for scraping each individual review.

First, it will collect “metadata”, which includes things like the total star rating for the book and the total ratings count:

func (a *Amazon) collectMetadata(domainHost, asin string) (*Result, error) {
	var outerErrors []string
	var totalRatingsCount int
	var totalRating float64

	collector := colly.NewCollector(colly.AllowedDomains(domainHost), colly.Async(true))

	collector.OnHTML("div[data-hook=total-review-count]", func(element *colly.HTMLElement) {
		tc, err := a.getTotalCount(element)
		if err != nil {
			outerErrors = append(outerErrors, err.Error())
			return
		}
		totalRatingsCount = tc
	})

	collector.OnHTML("span[data-hook=rating-out-of-text]", func(element *colly.HTMLElement) {
		r, err := a.getTotalRating(element.Text)
		if err != nil {
			outerErrors = append(outerErrors, err.Error())
			return
		}
		totalRating = r
	})

	u := a.getURL(asin)
	if err := collector.Visit(u); err != nil {
		return nil, fmt.Errorf("failed to scrape Amazon: %w", err)
	}
	collector.Wait()
	if len(outerErrors) != 0 {
		return nil, errors.New(strings.Join(outerErrors, ","))
	}
	return &Result{
		SourceName: SourceAmazon,
		TotalRating: Rating{
			Val: totalRating,
			Max: maxAmazonRating,
		},
		RatingsCount: totalRatingsCount,
	}, nil
}

func (a *Amazon) getTotalCount(element *colly.HTMLElement) (int, error) {
	totalText := element.ChildText("span")
	trimmed := strings.Trim(totalText, "global ratings")
	trimmed = strings.TrimSpace(trimmed)
	count, err := strconv.Atoi(trimmed)
	if err != nil {
		return -1, fmt.Errorf("filed to convert string to int: %w", err)
	}
	return count, nil
}

func (a *Amazon) getTotalRating(text string) (float64, error) {
	re := regexp.MustCompile(`[-+]?\d*\.\d+|\d+`)
	ratingText := re.Find([]byte(text))
	rating, err := strconv.ParseFloat(string(ratingText), 64)
	if err != nil {
		return -1, fmt.Errorf("failed to parse rating text to float: %w", err)
	}
	return rating, nil
}

Above, two functions are registered which will be called when the collector visits a page and finds HTML elements which match the given selector.

The first parameter in collector.OnHTML() is termed the “selector”. This is a goquery selector. I got the relevant element names by browsing to an Amazon review page and inspecting the elements by hand. One function will try to extract the total number of ratings for the book, and the other will try to extract the total star rating for the book.

Once that’s done, I collect the reviews themselves:

func (a *Amazon) collectReviews(domainHost, asin string) ([]Review, error) {
	var outerErrors []string
	var allReviews []Review
	collector := colly.NewCollector(colly.AllowedDomains(domainHost), colly.Async(true))
	collector.OnHTML(".a-last", func(element *colly.HTMLElement) {
		if link, ok := element.DOM.Find("a[href]").Attr("href"); ok {
			logrus.WithField("url", link).Infof("next page")
			element.Request.Visit(link)
		}
	})

	collector.OnHTML("div[id=cm_cr-review_list]", func(element *colly.HTMLElement) {
		r, err := a.getAllReviews(element)
		if err != nil {
			a.lock.Lock()
			outerErrors = append(outerErrors, err.Error())
			a.lock.Unlock()
			return
		}
		a.lock.Lock()
		allReviews = append(allReviews, r...)
		a.lock.Unlock()
	})

	u := a.getURL(asin)
	if err := collector.Visit(u); err != nil {
		return nil, fmt.Errorf("failed to scrape Amazon reviews: %w", err)
	}
	collector.Wait()
	if len(outerErrors) != 0 {
		return nil, errors.New(strings.Join(outerErrors, ","))
	}
	return allReviews, nil
}

The above creates a new collector and registers two functions: one which will run when we come across a “Next Page” button, and another which will run when we encounter a review container div.

The first function basically just follows the link to the next review page and visits that, too. Because this is a separate collector from the metadata, the global rating won’t be scraped again: only the reviews and any possible next pages.

The second function takes the review container element and looks for all reviews within it by calling a.getAllReviews(). I might eventually change this to select each individual review div instead, which should allow us to process multiple individual reviews in parallel as opposed to just multiple review containers. That might be faster, on the other hand it would also mean more locking. At some point I’d be curious to run a benchmark test on a few different approaches.

func (a *Amazon) getAllReviews(element *colly.HTMLElement) ([]Review, error) {
	// The review are paginated, so this should be a manageable size for now
	var allReviews []Review
	var outerErr error
	reviews := element.DOM.ChildrenFiltered("div[data-hook=review]")
	reviews.Each(func(i int, selection *goquery.Selection) {
		name := selection.Find(".a-profile-name").Text()
		starTxt := selection.Find("i[data-hook=review-star-rating]").Find(".a-icon-alt").Text()
		if starTxt == "" {
			// Fallback
			starTxt = selection.Find("i[data-hook=cmps-review-star-rating]").Find(".a-icon-alt").Text()
		}
		rating, err := a.getTotalRating(starTxt)
		if err != nil {
			outerErr = fmt.Errorf("failed to get total rating: %w", err)
			return
		}

		titleTxt := selection.Find("a[data-hook=review-title]").Find("span").Text()
		titleTxt = strings.TrimSpace(titleTxt)
		reviewBody := selection.Find("span[data-hook=review-body]").Find("span").Text()
		reviewBody = strings.TrimSpace(reviewBody)
		reviewDateTxt := selection.Find(".review-date").Text()
		elements := strings.Split(reviewDateTxt, "on")
		if len(elements) != 2 {
			outerErr = fmt.Errorf("unexpected reviewed-at format: %s", reviewDateTxt)
			return
		}
		reviewDateTxt = strings.TrimSpace(elements[len(elements)-1])

		reviewedAt, err := time.Parse(amazonTimeLayout, reviewDateTxt)
		if err != nil {
			outerErr = fmt.Errorf("failed to parse review time: %w", err)
			return
		}
		review := Review{
			ReviewerName: name,
			Rating: Rating{
				Val: rating,
				Max: maxAmazonRating,
			},
			Text:       fmt.Sprintf("%s. %s", titleTxt, reviewBody),
			ReviewedAt: reviewedAt,
		}
		allReviews = append(allReviews, review)
	})
	return allReviews, outerErr
}

The above iterates over each review child element of the given review container and extracts information about the reviewer’s name, when they left the review, how many stars they gave to the book, the heading of the review, and what they wrote. It concatenates the heading with the rest of the review text for simplicity (eventually I want to use all of the review text to build word clouds and I saw no point in keeping the review heading separate).

I noticed that later review pages had a different data-hook value for the star rating than newer ones, so I had to add a fallback above to capture both.

So we grab the review details from that and then return all the reviews back to the caller. The caller waits for the collector to complete, checks for any errors that may have been encountered during the scraping process, and then returns the final result.

The error handling is very hacky here for now; I’m not yet sure if I want to fail the entire process if we get an error since that means the entire result would be unreliable.

For example, if I am building word clouds or sending daily review summaries, it would suck to find out that even one new review actually got missed after the fact.

On the other hand, if I miss one review and scrape 10 successfully, maybe I should go ahead and return the 10. But since this is aimed to new authors like…well, me…who don’t get a ton of reviews, even one review being missed in the tool would be quite a high percentage. I’ll figure out what I want to do here later.

Testing

Right now I only have one test, which basically just goes through the entire scraping process above. I put a page of reviews in my testdata directory, which is what is being used in the test:

func TestAmazonScrape(t *testing.T) {
	testCases := []struct {
		name            string
		retStatus       int
		getRetBody      func(t *testing.T) []byte
		wantErr         error
		wantResult      *Result
		wantReviewCount int
	}{
		{
			name:            "ninefox gambit",
			retStatus:       http.StatusOK,
			wantReviewCount: 10,
			wantResult: &Result{
				TotalRating: Rating{
					Val: 4.1,
					Max: 5,
				},
				Reviews:      nil,
				RatingsCount: 908,
			},
			getRetBody: func(t *testing.T) []byte {
				contents, err := os.ReadFile("testdata/gambit.html")
				require.NoError(t, err)
				return contents
			},
		},
	}

	for _, tc := range testCases {
		tc := tc
		t.Run(tc.name, func(t *testing.T) {
			t.Parallel()
			testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
				w.WriteHeader(tc.retStatus)
				_, err := w.Write(tc.getRetBody(t))
				require.NoError(t, err)
			}))

			defer testServer.Close()

			amzn := NewAmazon()
			amzn.domainURL = testServer.URL

			gotRes, gotErr := amzn.Scrape("1781084491")
			require.ErrorIs(t, gotErr, tc.wantErr)
			require.EqualValues(t, tc.wantResult.RatingsCount, gotRes.RatingsCount)
			require.EqualValues(t, tc.wantResult.TotalRating, gotRes.TotalRating)
			require.Len(t, gotRes.Reviews, tc.wantReviewCount)
		})
	}
}

Frontend

For the frontend, I thought I’d try Vue.js for the first time. It’s not pretty, but it was pretty simple to get something rough up and running. I am sure I am not doing a bunch of things right, but I’ll improve it as I learn.

I added a Reviews.vue component:

<template>
  <h3>Collate Reviews</h3>

    <form v-on:submit.prevent="getReviews">
      <input v-model="form.asin" placeholder="ASIN" />
      <button>Submit</button>
    </form>

  <div>
    <h4>Reviews</h4>
      <div v-for="set in reviewSets" :key="set.sourceName">
        <h5>{{set.sourceName}}</h5>
        <p>{{set.rating.val}}/{{set.rating.max}}</p>
        <p>{{set.ratingsCount}} ratings, {{set.reviews.length}} reviews</p>
        <div v-for="rev in set.reviews" :key="rev.reviewerName">
          <p>{{rev.rating}} - {{rev.reviewedAt}}</p>
          <p>{{rev.text}}</p>
          <hr />
        </div>
        </div>
  </div>

</template>

<script>
import axios from 'axios';
import { Review, ReviewSet } from '@/scripts/review.js'

export default {
  name: 'Reviews',
  data(){
    return{
      form: {
        asin: '',
      },
      reviewSets: [],
    }
  },
  methods:{
    getReviews(){
      axios.get('http://localhost:3334/api/reviews',  { params: { asin: this.form.asin } })
        .then((res) => {
          for (let i = 0; i < res.data.length; i++) {
            const set = res.data[i]
            let reviewSet = new ReviewSet(set.SourceName, set.TotalRating.Val, set.TotalRating.Max, set.RatingsCount);
            for (let n = 0; n < set.Reviews.length; n++) {
              const r = set.Reviews[n];
              let review = new Review(r.Rating.Val, r.ReviewerName, r.Text, r.ReviewedAt);
              reviewSet.reviews.push(review)
            }
            this.reviewSets.push(reviewSet);
          }
        })
        .catch((error) => {
          console.error(error)
        })
    }
  }
}

</script>

<style>
  h3 {
    margin-bottom: 5%;
  }
</style>

The above creates a simple text box to input the book’s ASIN. When the user submits the form, I use axios to make a get request to my Go backend. That does all the scraping and returns the data in JSON format, where the method formats it for display. I define the ReviewSet and Review classes in separate helper js files:

export class Review {
    constructor(rating, reviewerName, text, reviewedAt) {
        this.rating = rating;
        this.reviewerName = reviewerName;
        this.text = text;
        this.reviewedAt = reviewedAt;
    }
}

export class ReviewSet {
    constructor(sourceName, ratingVal, ratingMax, ratingsCount) {
        this.reviews = [];
        this.sourceName = sourceName;
        this.rating = {
            val: ratingVal,
            max: ratingMax,
        };
        this.ratingsCount = ratingsCount;
    }
}

This was a kind of rushed hackaround a few evenings ago and by the time I got to the frontend it was getting pretty late, so to be honest I did not pay much attention here aside from “just get it running” (for now).

As you can see this doesn’t actually do anything yet with the reviews that it scrapes.

The Result

Next Steps

For next steps, I’ll proably do the following:

Put the above behind Auth0.
Set up a DB for the service (not sure what I’ll use yet, just know I’ll probably host it on AWS).
Have the logged in user (ie probably just me) be able to register books they’re interested in receiving review reports for.
Review security considerations: authors often write under pen names which they don’t want linked to their emails, names, etc. Ensure any relevant data is encrypted.
Allow logged in user to schedule a daily summary email of new reviews for the books they’ve registered.
Add Goodreads scraper.
Look into extracting useful info from reviews as opposed to just sending reports summarizing new reviews since the last report.

Oct 12, 2021

dev