Need help in stealing from google books!

Let me justify the headline first, I only want to download a file which is viewed from google’s book reader. How I got here? Well, I wanted a book The Art of Supercell: 10th Anniversary Edition for free of course. I tried various online sources but returned empty handed. Then I got down to hacking myself. I scraped the above site, at “read the book” page, for data and info. I got links like

https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA54&img=1&zoom=3&hl=en&sig=ACfU3U1qBXr2eX0UErYSYN1gwOyBoFC4fw&w=1280

This is a direct link to page 54 of that book. But this all was manual attempt, I first deleted a layer above reader which exposed me to img tag and then I simply copied the src attribute value. Scraping didn’t helped much.

So, anyone knows how to get all images of a book or direct .pdf file or any format?
My key idea :- get all images then bundle them into a pdf and to reduce size compress them.

Try http://libgen.li/

1 Like

Just view the png links in network tab and change page no. parameter. Then write a py script to download them and merge into pdf.
Btw Google books generally don’t have all the pages.

Sorry for delay in reply. But I already tried it, and it doesn’t works. Plus, it seems to me like Google Books reader works on on-demand key creation for selected book from your signature. That’s why you must not be able to view the image because the links must have expired and the session got over.

dude 1lib.us is the best one but if you want scribd, you can mention me, i sell 5 years guarantee accounts for 1.5 dollars

Pirates don’t pay, neither do I. :pirate_flag: Plus, I already made a book request there, they don’t have it.
Thanks, @EYYAD_MOHAMED

1 Like

Simulate a browser instance using Selenium. It’s easy to do

Yes, then. when I manually inspected it, it only had loaded till page 55. I think I need to first grab their key to the reader then mock the reader to extract the book.

Actually Selenium is not required at all.
For eg use requests lib to get the page 15 with book id Y7sOAAAAIAAJ using url (or url in OP):
https://books.google.co.in/books?id=Y7sOAAAAIAAJ&pg=PP15#v=onepage&q&f=false
Now parse the response and you’ll find link to png with sig parameter.
Repeat for every page with delay of 5-10 sec

Yeah, I did that and for first time I got urls. But after that there were no urls in response.
Everytime I scroll the reader a new page is requested via xhr req, found under network tab

https://books.google.co.in/books?id=pyIhEAAAQBAJ&lpg=PP1&pg=PA14&jscmd=click3

The response of this conatins url to image but now it does not. And also tried scraping via BeautifulSoup for “sig” values using requests and regex in python. It only contains link of 5 pages. But by Manupilating the above link for very first I got the links. The response data looks like, and have the “src” as link to the image for specified page,

{"page":[{"pid":"PA14","src":"https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA14&img=1&zoom=3&hl=en&sig=ACfU3U1jB9uozla8Z50cBI2Wwh5bWOnE3Q","flags":0,"order":15,"uf":"https://books.google.co.in/books_feedback?id=pyIhEAAAQBAJ&spid=AFLRE72-doN3cPk5_WmAaPU1hI-MwaCuKlWPYghNChcwB-Hv1Ydp4F7w3dpzB-WLBdhR9BfwYd2p&ftype=0"},{"pid":"PA13","src":"https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA13&img=1&zoom=3&hl=en&sig=ACfU3U1729YW6qKDwGmBNslONxvK6_gMIw"},{"pid":"PA15","src":"https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA15&img=1&zoom=3&hl=en&sig=ACfU3U34AWC8o8OpSHFRVsGGCl7mG1X7Yw"},{"pid":"PA16","src":"https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA16&img=1&zoom=3&hl=en&sig=ACfU3U3i-_-zNGqYtM8wsGe5JYxkj9ELfg"},{"pid":"PA17","src":"https://books.google.co.in/books/publisher/content?id=pyIhEAAAQBAJ&pg=PA17&img=1&zoom=3&hl=en&sig=ACfU3U3KhVwdT8DOyv_3KUqZd-j4mfmheQ"},{"pid":"PP1"},{"pid":"PA2"},{"pid":"PA3"},{"pid":"PA5"},{"pid":"PA9"},{"pid":"PA10"},{"pid":"PA11"},{"pid":"PA12"},{"pid":"PA13"},{"pid":"PA14"},{"pid":"PA15"},{"pid":"PA16"},{"pid":"PA17"},{"pid":"PA18"},{"pid":"PA19"},{"pid":"PA20"},{"pid":"PA21"},{"pid":"PA22"},{"pid":"PA23"},{"pid":"PA24"},{"pid":"PA25"},{"pid":"PA26"},{"pid":"PA27"},{"pid":"PA28"},{"pid":"PA29"},{"pid":"PA30"},{"pid":"PA31"},{"pid":"PA32"},{"pid":"PA33"},{"pid":"PA34"},{"pid":"PA35"},{"pid":"PA36"},{"pid":"PA37"},{"pid":"PA38"},{"pid":"PA39"},{"pid":"PA40"},{"pid":"PA41"},{"pid":"PA42"},{"pid":"PA43"},{"pid":"PA44"},{"pid":"PA45"},{"pid":"PA46"},{"pid":"PA47"},{"pid":"PA48"},{"pid":"PA49"},{"pid":"PA50"},{"pid":"PA54"}]}

Now I’m getting response like,

b'{"page":[{"pid":"PA14","flags":8,"order":15},{"pid":"PP1"},{"pid":"PA2"},{"pid":"PA3"},{"pid":"PA5"},{"pid":"PA6"},{"pid":"PA7"},{"pid":"PA8"},{"pid":"PA9"},{"pid":"PA10"},{"pid":"PA11"},{"pid":"PA12"},{"pid":"PA13"},{"pid":"PA14"},{"pid":"PA15"},{"pid":"PA16"},{"pid":"PA17"},{"pid":"PA18"},{"pid":"PA19"},{"pid":"PA20"},{"pid":"PA21"},{"pid":"PA22"},{"pid":"PA23"},{"pid":"PA24"},{"pid":"PA25"},{"pid":"PA26"},{"pid":"PA27"},{"pid":"PA28"},{"pid":"PA29"},{"pid":"PA30"},{"pid":"PA31"},{"pid":"PA32"},{"pid":"PA33"},{"pid":"PA34"},{"pid":"PA35"},{"pid":"PA36"},{"pid":"PA37"},{"pid":"PA38"},{"pid":"PA39"},{"pid":"PA40"},{"pid":"PA41"},{"pid":"PA42"},{"pid":"PA43"},{"pid":"PA44"},{"pid":"PA45"},{"pid":"PA46"},{"pid":"PA47"},{"pid":"PA48"},{"pid":"PA49"},{"pid":"PA50"},{"pid":"PA51"},{"pid":"PA52"},{"pid":"PA53"},{"pid":"PA54"},{"pid":"PA55"},{"pid":"PA56"},{"pid":"PA57"},{"pid":"PA58"},{"pid":"PA59"},{"pid":"PA60"},{"pid":"PA61"},{"pid":"PA62"},{"pid":"PA63"},{"pid":"PA64"},{"pid":"PA65"},{"pid":"PA66"},{"pid":"PA67"},{"pid":"PA68"},{"pid":"PA69"},{"pid":"PA70"},{"pid":"PA71"},{"pid":"PA72"},{"pid":"PA73"},{"pid":"PA74"},{"pid":"PA75"},{"pid":"PA76"},{"pid":"PA77"},{"pid":"PA78"},{"pid":"PA79"},{"pid":"PA80"},{"pid":"PA81"},{"pid":"PA82"},{"pid":"PA83"},{"pid":"PA84"},{"pid":"PA85"},{"pid":"PA86"},{"pid":"PA87"},{"pid":"PA88"},{"pid":"PA89"},{"pid":"PA90"},{"pid":"PA91"},{"pid":"PA92"},{"pid":"PA93"},{"pid":"PA94"},{"pid":"PA95"},{"pid":"PA96"},{"pid":"PA97"},{"pid":"PA98"},{"pid":"PA99"},{"pid":"PA100"},{"pid":"PA101"},{"pid":"PA102"},{"pid":"PA103"},{"pid":"PA104"},{"pid":"PA106"},{"pid":"PA107"},{"pid":"PA108"},{"pid":"PA109"},{"pid":"PA111"},{"pid":"PA112"},{"pid":"PA113"},{"pid":"PA116"},{"pid":"PA117"},{"pid":"PA118"},{"pid":"PA119"},{"pid":"PA121"},{"pid":"PA122"},{"pid":"PA125"},{"pid":"PA126"},{"pid":"PA127"},{"pid":"PA130"},{"pid":"PA132"},{"pid":"PA133"},{"pid":"PA134"},{"pid":"PA135"},{"pid":"PA137"},{"pid":"PA138"},{"pid":"PA140"},{"pid":"PA141"},{"pid":"PA142"},{"pid":"PA143"},{"pid":"PA144"},{"pid":"PA145"},{"pid":"PA146"},{"pid":"PA149"},{"pid":"PA150"},{"pid":"PA151"},{"pid":"PA152"},{"pid":"PA153"},{"pid":"PA154"},{"pid":"PA155"},{"pid":"PA157"},{"pid":"PA158"},{"pid":"PA160"},{"pid":"PA161"},{"pid":"PA162"},{"pid":"PA164"},{"pid":"PA165"},{"pid":"PA166"},{"pid":"PA169"},{"pid":"PA170"},{"pid":"PA171"},{"pid":"PA172"},{"pid":"PA173"},{"pid":"PA176"},{"pid":"PA178"},{"pid":"PA179"},{"pid":"PA180"},{"pid":"PA181"},{"pid":"PA182"},{"pid":"PA183"},{"pid":"PA185"},{"pid":"PA187"},{"pid":"PA188"},{"pid":"PA190"},{"pid":"PA191"},{"pid":"PA192"},{"pid":"PA193"},{"pid":"PA194"},{"pid":"PA195"},{"pid":"PA196"},{"pid":"PA197"},{"pid":"PA198"},{"pid":"PA199"},{"pid":"PA200"},{"pid":"PA201"},{"pid":"PA202"},{"pid":"PA203"},{"pid":"PA204"},{"pid":"PA205"},{"pid":"PA208"},{"pid":"PA209"},{"pid":"PA210"}]}'

The book’s url:
https://books.google.co.in/books?id=pyIhEAAAQBAJ&pg=PP15#v=onepage&q&f=false

Did you use any user-agent to emulate a real browser ?
Even then I think Google uses javascript to detect a real user vs a bot. You may have to start new session each time from a different ip. There are few libraries which help you do that.

Okay! Now I’m giving up and will take up again this topic after some time. If I’ll be able to crack it I’ll inform you. Also, Thank you for your support and help. Thanks brother!