3

I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done .

public class DB {

public Connection conn = null;

public DB() {
    try {
        Class.forName("com.mysql.jdbc.Driver");
        String url = "jdbc:mysql://localhost:3306/test";
        conn = DriverManager.getConnection(url, "root","root");
        System.out.println("conn built");
    } catch (SQLException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    }
}

public ResultSet runSql(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.executeQuery(sql);
}

public boolean runSql2(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.execute(sql);
}

@Override
protected void finalize() throws Throwable {
    if (conn != null || !conn.isClosed()) {
        conn.close();
    }
}
}


public class Main {
public static DB db = new DB();

public static void main(String[] args) throws SQLException, IOException {
    db.runSql2("TRUNCATE Record;");
    processPage("http://m.naukri.com/login");
}

public static void processPage(String URL) throws SQLException, IOException{
    //check if the given URL is already in database;
    String sql = "select * from Record where URL = '"+URL+"'";
    ResultSet rs = db.runSql(sql);
    if(rs.next()){

    }else{
        //store the URL to database to avoid parsing again
        sql = "INSERT INTO  `test`.`Record` " + "(`URL`) VALUES " + "(?);";
        PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
        stmt.setString(1, URL);
        stmt.execute();

        //get useful information
        Connection.Response res = Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321")
                 .method(Method.POST)
                    .execute();  
        //http://m.naukri.com/login
        Map<String, String> loginCookies = res.cookies();
        Document doc = Jsoup.connect("http://m.naukri.com/login")
                  .cookies(loginCookies)
                  .get();

        if(doc.text().contains("")){
            System.out.println(URL);
        }

        //get all links and recursively call the processPage method
        Elements questions = doc.select("a[href]");
        for(Element link: questions){
            if(link.attr("abs:href").contains("naukri.com"))
                processPage(link.attr("abs:href"));
        }
    }
}
}

And the table structure also

 CREATE TABLE IF NOT EXISTS `Record` (
 `RecordID` INT(11) NOT NULL AUTO_INCREMENT,
 `URL` text NOT NULL,
  PRIMARY KEY (`RecordID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

Now i want to use my username and password for that crawling so that crawler can log in to the site dynamically and crawl infomation on the basis of keyword.. Lets say my username is lucifer & password is lucifer123

lucifer
  • 2,297
  • 18
  • 58
  • 100

1 Answers1

5

your approach is for stateless web access, which usually works for web services. sites are stateful. you authenticate once and after that, they use the session key stored in your cookie to authenticate you ( other means of authentication is also possible), so it is required. you must send parameters that your browser is sending. try monitoring what your browser send to site with firebug, and reproduce that in your code

--update--

Jsoup.connect("url")
  .cookie("cookie-name", "cookie-value")
  .header("header-name", "header-value")
  .data("data-name","data-value");

u can add multi cookie | header | data. and there is function for adding values from Map.

to find out what must be set, add fire bug to your browser, they all have their default developer console which can be started with F12. go to the url u want to get data and just add all thing in there to your jsoup request. i added some images from your site result capture

i marked important part in red.

u can get required cookies in your code with sending these info to site and get cookie from that and after getting response.cookies you attach these cookies to every request u make ;)

p.s: change your password A.S.A.P

alizelzele
  • 892
  • 2
  • 19
  • 34
  • 1
    i did not complete get u ,can you give me some example? – lucifer Jan 27 '15 at 17:44
  • @alizezele i will try your code ,Thanks for reply..BTW are you able to login to that site using my credentials?? – lucifer Jan 30 '15 at 06:27
  • your login information was all in `Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321")` so that is not that hard to login ;). by the way i test it again and password is changed(hopefully by you) – alizelzele Jan 30 '15 at 10:45
  • yes i am just asking that were you able to login in the website with my code?? i am asking about manual login – lucifer Jan 30 '15 at 11:40
  • your code is completely wrong. the url that check for login is not home page. it is `/nlogin/login.php`. i have not tried your code but i am pretty sure than wount work ;) – alizelzele Jan 30 '15 at 12:31